Article Categories
» Arts & Entertainment
» Automotive
» Business
» Careers & Jobs
» Education & Reference
» Finance
» Food & Drink
» Health & Fitness
» Home & Family
» Internet & Online Businesses
» Miscellaneous
» Self Improvement
» Shopping
» Society & News
» Sports & Recreation
» Technology
» Travel & Leisure
» Writing & Speaking

  Listed Article

  Category: Articles » Business » Marketing & Promotion » Article
 

Anatomy of a search engine crawler




By Rob Sullivan

When you go to a search engine and perform a search many people don't understand how those results end up there. Some people may think that sites are submitted while others know that a piece of software finds the pages. This article explains one piece of that puzzle: The search engine crawler.

Todays search engines rely on software packages called spiders or robots. These automated tools are used to search the web to discover new pages.

A brief history of search crawlers

The first crawler was the World Wide Web Wander and it appeared in 1993. It was developed by MIT and it's initial purpose was to measure the growth of the web. Soon after, however, an index was generated from the results – effectively the first "search engine."

Since then, crawlers have evolved and developed. Initially crawlers were simple creatures, only able to index specific bits of web page data such as meta tags. Soon, however, search engines realized that a truly effective crawler needs to be able to index other information, including visible text, alt tags, images and even other non-HTML content such as PDF's word processor documents and more.

How a crawler works

Generally, the crawler gets a list of URL's to visit and store. The crawler doesn't rank the pages, it only goes out and gets copies which it stores, or forwards to the search engine to later index and rank according to various aspects.

Search crawlers also are smart enough to follow links they find on pages. They may follow these links as they find them, or they will store them and visit them later.

To date there are literally dozens of crawlers out regularly indexing the web. Some are specialized crawlers – such as image indexers, while others are more general and therefore more well known.

Some of the most well known crawlers include Googlebot (from Google) MSNBot (from MSN) and Slurp (from Yahoo!). There is also the Teoma crawler (from Ask Jeeves), as well as an assortment of crawlers from other engines, such as shopping engines, blog search engines and more.

Generally, when a crawler comes to visit a site, they request a file called "robots.txt." this file tells the search crawler which files it can request, and which files or directories it's not allowed to visit.

The file can also be used to limit specific spiders access to any or all of the site, and can also be used to control how many times the crawler visits the site, by limiting it's speed or the times when the crawler can visit. (Yahoo!s Slurp and MSNBot both support the "Crawl Delay" directive which tells the crawlers to slow down on their crawling).

It's not imperative that a site have a robots.txt file however as a crawler will assume it is OK to index the site if there isn't such a file.

Generally, today's crawlers are stripped down versions of web browsers. Some, like Googlebot, are built upon a text based web browser called Lynx. Therefore one of the tools one can use to verify a site is the Lynx browser. by loading the site in the browser you can see essentially what the crawlers "sees." You can then look for errors in the pages as well as any navigation problems the crawler may come up against.

One other thing you may notice, as you view your web server log reports, is that some browsers come many different times and with many different configurations.

Yahoo!s Slurp, for example emulates many different hardware platforms – from Windows 98 to Windows XP, and many different browsers, from Internet Explorer to Mozilla. MSNbot also works like this – emulating different operating systems and browsers.

They do this to ensure compatibility – after all, the search engines want to be sure that the majority of their users find a site which they can use. Therefore, as a design tip, you should test your site against various hardware platforms and browsers as well. You don't have to use the variety that the search engines use, but you should test against Internet Explorer, Netscape and Firefox. Also, you should try your site on other platforms such as a Mac or Linux just to ensure compatibility.

You may also notice, upon reviewing your reports, that crawlers like Googlebot will visit repeatedly and request the same page(s) repeatedly. This is common as crawlers also want to be sure the site is stable and also to measure the page's change frequency.

If your site goes down temporarily when a crawler visits repeatedly like this, don't worry. The crawlers are smart enough to leave and come back later and try again. If, however, the continue to find the site down, or slow to respond, they may opt to stay away for longer periods, or index the site more slowly. This can negatively impact your site's performance in the search engines.

As time goes on, we'd expect these spiders to become even more advanced. As new authoring technology comes available, or new indexing options become available, then the search crawlers will be adapted. Remember, the goal of all the search engines is to have the most complete index of files found on the web. This means they want to be able to index more than just web pages.

So as you are designing your site, be sure to keep the crawlers in mind. Don't build your site for crawlers – build it for users – but be sure to test it thoroughly so that the crawlers see what you want them to without hindrances or roadblocks. Remember – the crawler is a site owners best friend.
 
 
About the Author
About the author:
Rob Sullivan - SEO Specialist and Internet Marketing Consultant. Any reproduction of this article needs to have an html link pointing to http://www.textlinkbrokers.com


Article Source: http://www.simplysearch4it.com/article/10889.html
 
If you wish to add the above article to your website or newsletters then please include the "Article Source: http://www.simplysearch4it.com/article/10889.html" as shown above and make it hyperlinked.



  Some other articles by Rob Sullivan
SEO Questions - Why do I see different Google results than my clients?
Having been in this industry for as long as I have, I often forget some of the basics. Well, it's not that I forget, it's just that ...

The Importance of Web Analytics :: Using Your Analytics Properly
Analytics are very important to your web marketing campaign. If you do not use analytics properly you may not understand how effective your search engine marketing is. In this article I ...

SEO and Search Engine Forums & Conferences :: are they really helpful?
Working in the SEO/SEM industry can be very rewarding. Many times a problem can be solved simply with a little online ...

The Good and the Bad of SEO - From Googles Mouth!
I recently had the opportunity to ask questions of some Google staffers. There were some questions I felt I needed to get verification on, so when I ...

Google and Sun - a partnership to kill Microsoft or a deal with the devil?
As you have probably already heard, Google and Sun have partnered up to distribute the Google Toolbar with Sun's Java. While this may seem like a minor deal in the grand scheme of things, upon further ...

How to Monetize your site using AdSense
With the advent of blogs and other informational sites, the search engine market space has become increasingly competitive. Sometimes website owners begin to wonder if they will ever make money off their product ...

  
  Recent Articles
The Affiliate Marketing Network Advantage
by Laurie Raphael

Marketing your business online
by Candy Steele

Public Relations
by Ismael D. Tabije

Thirteen Step Action Plan For Everyone, That Needs More Business Now.
by Paul Douglas

Article Marketing & List Building: How to Promote Your Ezine & Build Your Own Hyper-Responsive List
by Eric Gruber

How To Build An Opt In List And Your Business
by Dencho Denchev

4 color printing in business cards and posters; You cannot have it any other way
by Florie Lyn Masarate

Plumbing marketing approaches that make your business work with a profit
by Ken Wilson

Builders projects in India
by yaken schecher

What You Should Know To Build Your Affiliate Web Site
by Laurie Raphael

Professional Logo Design: The Foundation To A Powerful Brand
by Alfred Anderson

Equipment, cost and communication; What good printers are made of
by Florie Lyn Masarate

Can't connect to database