Crawling

Crawling is a search engine technique that implements web crawlers. These web crawlers use algorithms to comb web pages and index the data to be returned in search results.

Algorithms
The web crawlers operate under a set of rules implemented by algorithms on how they browse the web for pages to index.

Some algorithm techniques for web crawlers include:

 * Selection policy- The size of the web is too large for a single crawler to comb through so implementing a selection policy allows a crawler to prioritize specific keywords.
 * Restricting followed links - When mapping the web a crawler may only want to map HTML based sites so this is done by restricting all non HTML links.
 * URL normalization- Crawlers can implement this technique to shorten paths of websites and show website paths that have empty returns allowing for better mapping of a websites pages.
 * Path ascending- This algorithm forces a crawlers to traverse completely through a URL's path to find isolated resources that would not have been found otherwise.
 * Academic- academic crawlers focus on files like pdf, postscript and word files to search. This allows the crawler to index academic papers that have been posted online.
 * Re-visit policy- a re-visit policy is often implemented in a crawler because the web is dynamic and ever changing. This can lead to inaccurate data from the last time a crawler passed through a website so and algorithm can be written to determine if a crawler needs to traverse through a website again based on the probability that it changed or has new information.