信息检索英语教学课件:Lecture 2 Web Crawling.pdfVIP

  • 7
  • 0
  • 约3.42万字
  • 约 79页
  • 2021-06-16 发布于安徽
  • 举报

信息检索英语教学课件:Lecture 2 Web Crawling.pdf

该文档是极速PDF编辑器生成, 如果想去掉该提示,请访问并下载: / Information Retrieval Web Mining Lecture 2: Web Crawling Outline  Web Crawling  Duplicate Detection 2 Motivation  The necessity of an effective paradigm for a web crawler when performing web mining on large numbers of web pages To extract patterns from the web To extract meaning from the link structure of the web 3 Web Crawler  A web crawler, robot or spider  A program that is capable of iteratively and automatically, Downloading web pages Extracting URLs from their HTML Fetching them 4 Retrieving Web Pages  Every page has a unique uniform resource locator (URL)  Web pages are stored on web servers that use HTTP to exchange information with client software  e.g., 5 Retrieving Web Pages  Web crawler client program connects to a domain name system (DNS) server  DNS server translates the hostname into an internet protocol (IP) address  Crawler then attempts to connect to server host using specific port  After connection, crawler sends an HTTP request to the web server to request a page usually a GET request 6 Crawl

文档评论(0)

1亿VIP精品文档

相关文档