数据挖掘以及搜索引擎经典pptchap4.pptVIP

  • 1
  • 0
  • 约6.84千字
  • 约 34页
  • 2019-03-26 发布于湖北
  • 举报
Crawling the Web Outline Basic WWW Technologies Web的基本概念 Basic Crawling 基本的爬取算法 URI:Uniform Resource Identifier -Uniform Resource Identifiers URL: Uniform Resource Locators URN: Uniform Resource Names Every resource available on the Web has an address that may be encoded by a URL URIs typically consist of three pieces: The naming scheme of the mechanism used to access the resource. (HTTP, FTP) The name of the machine hosting the resource The name of the resource itself, given as a path URL,URN与URI的关系 URL,URN是URI的子集。 URI是以某种统一的(标准化的)方式标识资源的简单字符串。 URI一般由三部分组成: 1. 访问资源的命名机制。 2. 存放资源的主机名。 3. 资源自身的名称,由路径表示。 URI Example /TR There is a document available via the HTTP protocol Residing on the machines hosting Accessible via the path /TR Hypertext Transfer Protocol (HTTP) A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server One of the transport layer protocol supported by Internet HTTP communication is established via a TCP connection and server port 80 GET Method in HTTP HTML Hyperlink a href=relations/alumnialumni/a A link is a connection from one Web resource to another It has two ends, called anchors, and a direction Starts at the source anchor and points to the destination anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document) Anchor test (锚文本) Anchor text is the hyperlinked words on a web the words you click on when you click a link. Here‘s an example, reciprocal links, in which “reciprocal links” is the anchor text. 锚文本主要是为访问者提供指向网页内容的说明。 Outline Basic WWW Technologies Web的基本概念 Basic Crawling 基本的爬取算法 Web是一个有向图 Completeness Observations Completeness is not guaranteed 假设从一个page出发能到达web上的任何一个page. 实际情况并不一定这样 How to make it better: more seeds, more diverse seeds, port scanner maybe help 常用算法 Depth First Search Width First Search Depth-First Search Depth-First Search PROCEDURE SPIDER(G, {SEEDS}) Initialize COLLECTION big file of URLpairs//结果存储 Initia

文档评论(0)

1亿VIP精品文档

相关文档