可扩展Web信息搜集系统的设计实现与应用初探 .doc

下载文档 降价啦

6
0
约8.27万字
约 127页
2016-10-14 发布于重庆
举报
版权申诉
保障服务

可扩展Web信息搜集系统的设计实现与应用初探 .doc

1、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。。
2、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

可扩展Web信息搜集系统的设计实现与应用初探

可扩展Web信息搜集系统的设计、实现与应用初探摘要本文研究工作是国家重点基础研究发展规划项目“网络环境下海量信息组织与处理的理论与方法研究”的一部分；研究对象是Web这样的动态海量信息载体；研究的主要目标是要得到一种高性能、高可靠，支持海量网页信息搜集、分析与处理的系统结构。主要贡献包括以下几个方面： 1) 基于对网页性质及其分布的认识，设计和实现了一种可扩展海量Web信息搜集系统体系结构。结合Web信息搜集的基本要求和基于PC机群的并行分布处理技术，该系统结构力图在搜集策略、可扩展性、减少通信、负载平衡、任务调度、并行粒度控制等方面得到一个很好的折衷。在仔细的理论分析和大量模拟实验的基础上，目前这种体系结构已经成功地实现并投入运行，在系统规模从1到18台机器变化的范围表现出很好的可扩展性，达到了15天搜集5700万网页的性能指标。 2) 针对并行网页搜集系统的节点可能出现临时故障的问题，提出了一种系统动态可配置方案。该方案的基础是一种从网页URL到搜集节点的两阶段映射关系，它保证了当配置（节点数）变化时系统能经过一个短暂、安全的过渡过程达到一个新的稳态，从而保证了系统的动态可配置性。目前这种方案已经实现，并成功应用于“天网”搜索引擎和“燕穹”Web信息博物馆的存储系统中。 3) 基于“燕穹”Web信息博物馆中的网页信息，探讨了海量Web信息应用的内容和方法。通过分析几千万网页的链接结构，给出了对2002年初中国Web的大小、形状和结构的一种定量认识，同时说明了如何从海量网页信息中高效地识别网络社区的一种方法。关键词：万维网，搜索引擎，可扩展Web信息搜集，Web信息博物馆，动态可配置性，负载平衡，Web挖掘 Abstract We study Web as a massive information resource with rapidly evolving nature. In particular, we will describe in this thesis a high performance architecture and reliable mechanism for gathering, analyzing, and processing vast amount of web pages. The main contributions include: Based on an understanding of web pages and their distribution, a scalable architecture for gathering web pages is proposed, and a thorough study of the architecture is provided. Combining cluster-based parallel processing technology with the demanding requirement of crawling through vast amount web information, this architecture demonstrates a reasonable trade-off in crawling strategy, communication reduction, load balancing, task scheduling, and granularity control. Through a process of design, simulation, and implementation, a system is constructed and put in operation, demonstrating excellent scalability in the range of 1 to 18 processing nodes and having reached our performance goal: crawling 57 million web pages in 15 days. Aimed at the problem that nodes may occasionally fail in long crawling process, a scheme is proposed for dynamic system reconfiguration. The scheme is based on a two-phase mapping between URLs and processing nodes, which ensures that upon a change