Privacy
Issues When Using Robots.txt & The Robots Meta Tag?
Understanding the difference between the robots.txt file
and Robots <META> Tag is critical for search engine optimization and
security. It can have a profound impact on the privacy of your website and
customers as well. The first thing to know is what robots.txt files and Robots
<META> Tags are.
Robots.txt
Robots.txt is a file you place in your website’s top
level directory, the same folder in which a static homepage would go. Inside
robots.txt, you can instruct search engines to not crawl content by disallowing
file names or directories. There are two parts to a robots.txt directive, the
user-agent and one or more disallow instructions.
The user-agent specifies one or all Web crawlers or
spiders. When we think of Web crawlers we tend to think Google and Bing;
however, a spider can come from anywhere, not just search engines, and there
are many of them crawling the Internet.
Here is a simple robots.txt file telling all Web crawlers
that it is okay to spider every page:
User-agent: *
Disallow:
To disallow all search engines from crawling an entire
website, use:
User-agent: *
Disallow: /
The difference is the forward slash after Disallow:,
signifying the root folder and everything in it, including sub-folders and
files.
Is
Robots.txt A Security Or Privacy Risk?
Using robots.txt to hide sensitive or private files is a
security risk. Not only might search engines index disallowed files, it is like
giving a treasure map to pirates.
Use
Robots <META> Tag To Keep Files Out Of The Search Index
Because robots.txt does not exclude files from the search
indexes, Google and Bing follow a protocol which does accomplish exactly that,
the Robots <META> tag.
<html>
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX,
FOLLOW">
</head>
The robots <META> tag provides two instructions:
- index or noindex
- follow or nofollow
Index or noindex instructs search engines whether or not
to index a page. When you select index, they may or may not choose to include a
webpage in the index. If you select noindex, the search engines will definitely
not include it.
Follow or nofollow instructs Web crawlers whether or not
to follow the links on a page. It is like adding an rel=”nofollow” tag to every
link on a page. Nofollow evaporates PageRank, the raw search engine ranking
authority passed from page to age via links. Even if you noindex a page, it is
probably a bad idea to nofollow it. Let PageRank flow through to its final
conclusion. Otherwise, you could be pouring perfectly good link juice down the
drain.
When you want to exclude a page from the search engine
indexes, do this:
<html>
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX,
FOLLOW">
</head>