FANDOM


()
()
 
Line 5: Line 5:
   
 
=== Some algorithm techniques for web crawlers include: ===
 
=== Some algorithm techniques for web crawlers include: ===
* Selection policy- The size of the web is too large for a single crawler to comb through so implementing a selection policy allows a crawler to prioritize specific keywords.
+
* Selection policy- The size of the web is too large for a single crawler to comb through so implementing a selection policy allows a crawler to prioritize specific keywords.<ref>Steve Lawrence; C. Lee Giles (1999-07-08). "Accessibility of information on the web". ''Nature''. '''400''' (6740): 107–9. Bibcode:1999Natur.400..107L. doi:10.1038/21987. PMID 10428673.</ref>
 
* Restricting followed links - When mapping the web a crawler may only want to map HTML based sites so this is done by restricting all non HTML links.
 
* Restricting followed links - When mapping the web a crawler may only want to map HTML based sites so this is done by restricting all non HTML links.
* URL normalization- Crawlers can implement this technique to shorten paths of websites and show website paths that have empty returns allowing for better mapping of a websites pages.
+
* URL normalization- Crawlers can implement this technique to shorten paths of websites and show website paths that have empty returns allowing for better mapping of a websites pages.<ref>Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web" (PDF). In Levene, Mark; Poulovassilis, Alexandra. ''Web Dynamics: Adapting to Change in Content, Size, Topology and Use''. Springer. pp. 153–178. ISBN 978-3-540-40676-1.</ref>
  +
* Path ascending- This algorithm forces a crawlers to traverse completely through a URL's path to find isolated resources that would not have been found otherwise.<ref>Cothey, Viv (2004). "Web-crawling reliability" (PDF). ''Journal of the American Society for Information Science and Technology''. '''55''' (14): 1228–1238. doi:10.1002/asi.20078.</ref>
  +
* Academic- academic crawlers focus on files like pdf, postscript and word files to search. This allows the crawler to index academic papers that have been posted online. <ref>Jian Wu, Pradeep Teregowda, Madian Khabsa, Stephen Carman, Douglas Jordan, Jose San Pedro Wandelmer, Xin Lu, Prasenjit Mitra, C. Lee Giles, Web crawler middleware for search engine digital libraries: a case study for citeseerX, In proceedings of the twelfth international workshop on Web information and data management Pages 57-64, Maui Hawaii, USA, November 2012.</ref>
  +
* Re-visit policy- a re-visit policy is often implemented in a crawler because the web is dynamic and ever changing. This can lead to inaccurate data from the last time a crawler passed through a website so and algorithm can be written to determine if a crawler needs to traverse through a website again based on the probability that it changed or has new information. <ref>https://dl.acm.org/citation.cfm?doid=342009.335391</ref>
  +
  +
== Refrences ==
  +
<references />
   
 
===<nowiki/>===
 
===<nowiki/>===

Latest revision as of 17:01, April 15, 2018

Crawling is a search engine technique that implements web crawlers. These web crawlers use algorithms to comb web pages and index the data to be returned in search results.

Algorithms Edit

The web crawlers operate under a set of rules implemented by algorithms on how they browse the web for pages to index.

Some algorithm techniques for web crawlers include: Edit

  • Selection policy- The size of the web is too large for a single crawler to comb through so implementing a selection policy allows a crawler to prioritize specific keywords.[1]
  • Restricting followed links - When mapping the web a crawler may only want to map HTML based sites so this is done by restricting all non HTML links.
  • URL normalization- Crawlers can implement this technique to shorten paths of websites and show website paths that have empty returns allowing for better mapping of a websites pages.[2]
  • Path ascending- This algorithm forces a crawlers to traverse completely through a URL's path to find isolated resources that would not have been found otherwise.[3]
  • Academic- academic crawlers focus on files like pdf, postscript and word files to search. This allows the crawler to index academic papers that have been posted online. [4]
  • Re-visit policy- a re-visit policy is often implemented in a crawler because the web is dynamic and ever changing. This can lead to inaccurate data from the last time a crawler passed through a website so and algorithm can be written to determine if a crawler needs to traverse through a website again based on the probability that it changed or has new information. [5]

Refrences Edit

  1. Steve Lawrence; C. Lee Giles (1999-07-08). "Accessibility of information on the web". Nature400 (6740): 107–9. Bibcode:1999Natur.400..107L. doi:10.1038/21987. PMID 10428673.
  2. Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web" (PDF). In Levene, Mark; Poulovassilis, Alexandra. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Springer. pp. 153–178. ISBN 978-3-540-40676-1.
  3. Cothey, Viv (2004). "Web-crawling reliability" (PDF). Journal of the American Society for Information Science and Technology55 (14): 1228–1238. doi:10.1002/asi.20078.
  4. Jian Wu, Pradeep Teregowda, Madian Khabsa, Stephen Carman, Douglas Jordan, Jose San Pedro Wandelmer, Xin Lu, Prasenjit Mitra, C. Lee Giles, Web crawler middleware for search engine digital libraries: a case study for citeseerX, In proceedings of the twelfth international workshop on Web information and data management Pages 57-64, Maui Hawaii, USA, November 2012.
  5. https://dl.acm.org/citation.cfm?doid=342009.335391

Edit

Community content is available under CC-BY-SA unless otherwise noted.