[工学]web数据挖掘课件.ppt

[工学]web数据挖掘课件

Ch. 8: Web Crawling By Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu Springer, 2007 Outline Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Preferential (focused and topical) crawlers Evaluation of preferential crawlers Crawler ethics and conflicts New developments: social, collaborative, federated crawlers Many names Crawler Spider Robot (or bot) Web agent Wanderer, worm, … And famous instances: googlebot, scooter, slurp, msnbot, … Googlebot you Motivation for crawlers Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners Monitor Web sites of interest Evil: harvest emails for spamming, phishing… … Can you think of some others?… A crawler within a search engine One taxonomy of crawlers Many other criteria could be used: Incremental, Interactive, Concurrent, Etc. Outline Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Preferential (focused and topical) crawlers Evaluation of preferential crawlers Crawler ethics and conflicts New developments: social, collaborative, federated crawlers Basic crawlers This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything Graph traversal (BFS or DFS?) Breadth First Search Implemented with QUEUE (FIFO) Finds pages along shortest paths If we start with “good” pages, this keeps us close; maybe other good stuff… Depth First Search Implemented with STACK (LIFO) Wander away (“lost in cyberspace”) A basic crawler in Perl Queue: a FIFO list (shift and push) my @frontier = read_seeds($file); while (@frontier $tot $max) { my $next_link = shift @frontier; my $page = fetch($next_link); add_to_index($page); my @links = ex

文档评论(0)

1亿VIP精品文档

相关文档