Article: Making Your Pages Easy for Search Engines to Index
Author: Andrew Starling
Date: February 4, 2002
Voodoo
First, let's dispense with the
zombies and headless chickens. Many companies and experts offer methods of
deceiving search engine spiders to gain high rankings. And a few of these
methods even work — at least for a while.
On the other side of the front line are the search engines themselves,
engaged in a perpetual battle to identify these techniques, commonly known
as spamming, and penalize the sites that use them because they distort
search results. And of course the search engines want the sole right to do
that themselves, through accepting payments for high rankings, but that's
another story.
Often the techniques used by spamming experts backfire — they're spotted
and punished by low rankings. Even more often, the same techniques are
attempted by regular webmasters, who are not dedicated search engine
experts and can't keep up with the progress of the battle, so they get
wiped out soon after the opening credits. As a small example, just
renaming your pages with spider-friendly filenames, without changing their
content, can get them demoted in the rankings.
The easiest way to avoid this virtual war is to step aside and provide the
search engines with what they want and observe their rules. Few sites
manage to do this, and yet it's both effective and "legal".
The Time Element
There will be no instant
results. Getting a good search engine ranking takes time. That's one of
the best ways to identify charlatans from real experts. Anybody who
promises instant gratification is in the voodoo business.
Search engines do not publish their precise methods of working, but most
anecdotal evidence points to the importance of time. They simply do not
trust new sites to deliver the goods. Also the engines are oversubscribed
and have too many sites to index, so they're prone to dropping new sites
from their listings on the grounds that many young sites are destined not
to reach maturity, so it would be silly to take them too seriously while
they're still in diapers.
You need to allow six months to a year for a decent search engine strategy
to work. It will then continue to work for a long time, with minimal
effort. But in the first few months it might not be a star performer. If
your site doesn't provide what visitors want to see then it might never
perform at all, but once again, that's a different story.
During the first year of promoting a new site, you may have to resubmit
your site a number of times because it's repeatedly dropped by the search
engines. That's fine. It's part of the game. They're just checking that
you're serious.
You may also experience long delays between submission and actual listing.
Look at the details provided by the search engine when you make your
submission — often they'll tell you how long the delay will be. Allow a
few weeks extra, to be on the safe side, before resubmitting.
All your submissions should be done manually, and it's a good idea to keep
a record of them so you don't bug specific engines too often. If you do,
you will be penalized. Automatic submission systems ("With one keystroke
register your site with 1,500 search engines!") are for suckers. The
search engines quickly identify them and ignore them, or, worse still,
punish the sites that use them.
Spiders are Machines
Spiders (or robots) are
software programs the search engine companies create to trawl the Web and
index sites. They create massive databases that the engines then use to
return search results.
They follow rules of logic, impeccably, and have no flexibility. They have
no idea what your site really looks like, nor do they have a sense of
humor. It's highly unlikely that a real human will look at your site as
part of the indexing process. The exception is when you apply for a
listing with a directory such as Yahoo!.
When designing your site, it's important to remember that it will be read
by machines. This means, for example, if you turn all your major page
headings into graphics, the spiders won't be able to recognize where your
main heading are and identify the core text areas that follow, even though
this would pose no problem for a human viewer.
On the other hand, if you go all out for machine-readability, you may well
get the thumbs-down from a Yahoo reviewer, who has no interest in how your
pages appear to a machine. It's all a question of balance.
Text
Search engine spiders want to
read the text on your pages, and especially the introductory text near the
top of the page. This mirrors the way human beings assess pages — by
reading them, starting at the top.
Here are some guidelines to keep text-hungry spiders happy:
One: Provide text. Pages without text rarely gain high
rankings. This is especially important for home pages. If there's no text
on the opening page then the spider might stop right there and not even
bother to look at the rest of your site. It's one reason for avoiding
Splash pages at the front end. Ideally you should provide at least 150
words of text on your home page.
Two: Make
full use of early paragraphs to include relevant keywords. Most search
engines place emphasis on early text, and less on the words further down
the page. The numbers vary from engine to engine, but you can assume the
first 50 words are crucial, the next 50 are important, the 50 following
are likely to be read. After that, it's anybody's guess, though some
engines do manage to fully index pages with more than a thousand words.
Try to get your important keywords — the expressions you expect your
visitors to use in their searches — included in your first 150.
Three: Don't
overdo any repetition. If you repeat your keywords too often, you could be
penalized. There's no magic number to aim for, but if you repeat keywords
three times or less, you should be safe.
Four: Concentrate on the main text. You might have a separate top table (perhaps
containing an advert and logo) plus a left hand column with links. These
will appear in the HTML file before your main, central text block. There's
a temptation to think these areas are more important than the main text
area because spiders read them first. If these outlying areas contain a
lot of text (unlinked) then this may well be true. But many engines try to
ignore peripheral HTML blocks, especially if they're heavy on links, and
head straight for the center. It's not too difficult for them to do. They
simply look for the largest title (within <h> tags) on the page and assume
that whatever follows that is the most important text area.
Five: It's
not much use getting your keywords in the right place if you've chosen the
wrong ones. It doesn't help the spiders either. They'd prefer you to
choose the right keywords so their indexing works as intended. It's worth
spending a few hours on deciding your keywords, maybe trying out a few
expressions in the search engines and seeing if they deliver the sites you
want to compete with.
Six: Spiders
have lists of stop words — mainly related to adult content and profanity.
When they find one of these words they may abandon your site altogether.
If you have a page that includes a possible stop word, hide it from
spiders by making it an exclusion in your robots.txt file (see later).
Also watch out for words that have two meanings, one of which is sexual.
Spiders don't understand context.
Seven: If you
have pages full of links, make sure there's plenty of text to accompany
them. Pure link listings are often ignored by spiders, but if you add a
couple of sentences describing each link, the problem disappears.
Popular Sites are Exceptions
Often you can learn a few
tricks by looking at the most popular sites on the Web and seeing how they
do things. But not in this case. The most popular sites are given a
special status by search engines and indexed under slightly different
rules than regular sites. They are more likely to be indexed thoroughly
and frequently, which means they don't have to try as hard. Also, because
it's assumed they won't try to spam the engines, they're forgiven the
occasional mistake, such as overusing a keyword.
Titles and Filenames Count
Spiders like to see useful
page titles, and some also appreciate relevant filenames. It helps them,
but unfortunately the mechanism has been abused, so they're wary. Try to
use filenames and page titles that match your text content and keywords,
rather than using them to cover keywords that don't otherwise get a
mention. Words in filenames can be separated by an underscore — this is a
convention that IT professionals used before the Internet arrived, so it's
perfectly acceptable. But if your filenames turn into a long sequence of
keywords, spiders will assume you're trying to spam them.
Meta Tags
These go in the file header,
in two sections — keywords and description. The meta tag system has been
so heavily abused that some engines simply ignore them. But it's still
worth spending a few minutes on creating them for the engines that remain
interested. Keep them short and don't use words that are missing from the
main text. If you spend a long time working on meta tags, you're probably
trying to manipulate the system and you may well be found out. Create them
quickly, using the simplest, most obvious content, and it's more likely
they'll work as intended.
Image Alt Descriptions
These create the text that
shows in an image space before a graphic loads, and subsequently when the
mouse rolls over it. They've been sorely abused, often crammed with long
lists of keywords, and again the spiders have wised up and tend to ignore
them, or penalize obvious abuse.
Their proper use is to show visitors with text only browsers (and
impaired-vision visitors with talking browsers) what they're missing.
Using them as a method of presenting keywords is spamming and you can
hardly complain if it gets you a ranking penalty.
Frames
Frames confuse most spiders.
If you insist on using frames, then make the most of your <noframes> tag
and include a link within it to a sitemap or contents page that lists your
pages and links to them directly, rather than linking to framesets. You
can always force the framesets to appear when the links are followed in a
regular browser by using JavaScript, which the spiders will ignore. It's a
lot of work but at least it should get you listed in the search engines.
Robots.txt
This text file goes in your
root directory and gives instructions to spiders about which files and
directories to ignore when they're trawling your site. It can have other
uses too, but many of these are close to spamming techniques so won't be
covered here.
Here's a sample robots.txt
file
User-Agent: *
Disallow: /images/
Disallow: /bookmark*.html
Disallow: /cgi_bin/
Disallow: /status/
This tells all spiders (first
line) not to look inside the directories called images, cgi_bin and
status, and to ignore files called bookmark1.html, bookmark2.html and so
on. Incidentally, the linebreaks are important.
It's a good idea to include a
robots.txt file on your site, even if you don't have much to exclude. It
helps prevent spiders wasting their time poking around in your image
directories. And since spiders often tire and give up with sites without
fully indexing them (especially new sites) it can help you get the more
important areas of your site indexed.
Directory Structure
Spiders find their way around
your site by following your internal links. They prioritize pages that are
in the root directory, then first level directories, and if you're lucky
(or a very popular site) they may look at subdirectories beyond that, but
often they won't bother. That's why you find most professional sites have
a flat structure, with many pages in the root directory and first-level
subdirectories, rather than a deep structure with many levels of
subdirectories.
Dynamic Pages
Spiders generally have trouble
with these. Also they're a little frightened of them because they can get
trapped inside a dynamic page server, and may even bring the server down.
For this reason spiders identify dynamic pages by the question mark
contained in their URLs, and usually avoid them. Some will allow you to
submit specific dynamic pages, but they still won't follow the internal
links within them.
One solution is to create static gateway pages that include static links
to other pages on your site.
Make sure the link URLs are
inherently complete, not generated on the fly, that they don't contain
question marks, and that your server can translate these static links to
reach dynamic pages if it has to. Also make sure there's plenty of text on
the gateway page, that it isn't purely made up of links, otherwise it may
be ignored.
An alternative is to make
technical alterations to your system so the server can cope with a visit
from a spider, and then replace the question mark with a less obvious
symbol such as a % sign. There's no point in making this replacement if
the server won't be able to cope. The usual problem is that links to
dynamic pages are often created dynamically themselves, and spiders can't
manage this. They request pages with incomplete URLs missing query string
elements, the server sends back a request for more information to complete
the URL, which the spider can't understand, and the request turns into a
dangerous loop. To get over this you have to create a work-around for the
incomplete URL problem, and technically that's a demanding task.
For more details on getting
dynamic sites indexed, try NetMechanic.
Additional Links
(Back to
Main Page)
|