- 7
- 0
- 约3.42万字
- 约 79页
- 2021-06-16 发布于安徽
- 举报
该文档是极速PDF编辑器生成,
如果想去掉该提示,请访问并下载:
/
Information Retrieval Web Mining
Lecture 2: Web Crawling
Outline
Web Crawling
Duplicate Detection
2
Motivation
The necessity of an effective paradigm for a
web crawler when performing web mining
on large numbers of web pages
To extract patterns from the web
To extract meaning from the link structure of
the web
3
Web Crawler
A web crawler, robot or spider
A program that is capable of
iteratively and automatically,
Downloading web pages
Extracting URLs from their HTML
Fetching them
4
Retrieving Web Pages
Every page has a unique uniform resource
locator (URL)
Web pages are stored on web servers that use
HTTP to exchange information with client
software
e.g.,
5
Retrieving Web Pages
Web crawler client program connects to a domain
name system (DNS) server
DNS server translates the hostname into an
internet protocol (IP) address
Crawler then attempts to connect to server host
using specific port
After connection, crawler sends an HTTP request
to the web server to request a page
usually a GET request
6
Crawl
您可能关注的文档
- 马克思主义基本原理概论:世界的物质性试题(含有答案).doc
- 模拟集成电路设计实习培训内容介绍.doc
- 马克思主义基本原理概论选择题.docx
- 马克思主义基本原理概论:联系和发展的考研习题(含有答案).doc
- 信息检索英语教学课件:Lecture 3 Retrieval Models.pdf
- 信息检索英语教学课件:Lecture 4 Link Analysis.pdf
- 信息检索英语教学课件:Lecture 5 Web Information Extraction.pdf
- 信息检索英语教学课件:Lecture 6 Classification & Clustering.pdf
- 信息检索英语教学课件:Lecture 7 Collaborative Filtering.pdf
- 信息检索英语教学课件:Lecture 8 Open Source.pdf
最近下载
- 2025年普通高等学校招生全国统一考试(新疆卷)理科综合能力测试.docx VIP
- 基本气制动管路图讲解.ppt VIP
- DB21_T 4399-2026 露地甜樱桃灾害性天气防御技术规范.pdf VIP
- 选择性必修一教材问题答案(全部).pdf VIP
- 2017年刑法新规定224条.doc VIP
- 2009斯巴鲁森林人维修手册wi 19911c.pdf VIP
- 水煤浆加压气化法生产合成氨及尿素生产线项目可行性研究报告(1).docx VIP
- 数据中心气流组织技术规范(T/2019) 2019 33页.pdf VIP
- 3.1 《百合花》小说情节的梳理及作用(课件)高一语文对接高考之教材中的考点(统编版必修上册).pptx VIP
- 深度解析(2026)《GBT 19294-2003航空摄影技术设计规范》.pptx VIP
原创力文档

文档评论(0)