信息检索英语教学课件：Lecture 2 Web Crawling.pdfVIP

下载本文档

7
0
约3.42万字
约 79页
2021-06-16 发布于安徽
举报

信息检索英语教学课件：Lecture 2 Web Crawling.pdf

该文档是极速PDF编辑器生成，如果想去掉该提示,请访问并下载： / Information Retrieval Web Mining Lecture 2: Web Crawling Outline  Web Crawling  Duplicate Detection 2 Motivation  The necessity of an effective paradigm for a web crawler when performing web mining on large numbers of web pages To extract patterns from the web To extract meaning from the link structure of the web 3 Web Crawler  A web crawler, robot or spider  A program that is capable of iteratively and automatically, Downloading web pages Extracting URLs from their HTML Fetching them 4 Retrieving Web Pages  Every page has a unique uniform resource locator (URL)  Web pages are stored on web servers that use HTTP to exchange information with client software  e.g., 5 Retrieving Web Pages  Web crawler client program connects to a domain name system (DNS) server  DNS server translates the hostname into an internet protocol (IP) address  Crawler then attempts to connect to server host using specific port  After connection, crawler sends an HTTP request to the web server to request a page usually a GET request 6 Crawl

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

信息检索英语教学课件：Lecture 2 Web Crawling.pdfVIP