本科毕设论文—互联网网页文本对象抽取实现技术.docVIP

  • 3
  • 0
  • 约2.11万字
  • 约 45页
  • 2016-12-04 发布于辽宁
  • 举报

本科毕设论文—互联网网页文本对象抽取实现技术.doc

互联网网页文本对象抽取实现技术 摘 要 关键词: Implementation of text object extraction for Internet web pages Author: Zhang Hui Tutor: Lin Yaping Abstract Nowadays, there is a large number of semi-structural information which represents objects in the real world on the Internet. In order to deal with the severe challenge brought by information explosion, extract and integrate all kinds of text object information on web pages, and put up the object-level searching, it cries for the automated technologies to help people find the very information they really need among such a large number of information. The technology of text object extraction is just one of methods to solve this problem. Based on the traditional theory of Information Extraction and aiming at the blog domain, this paper puts forward an arithmetic implementing the extraction function for the text objects of blog articles with the HTML features and machine learning. In this arithmetic, it analyses the features of blog pages, introduces an arithmetic for web page partition basing on the HTML tag features, uses decision tree to do statistics and training on the blog data set, tests and evaluates this arithmetic using the expert statistical tool, WEKA, and summarizes the advantages as well as the points needing improving. Finally, it shows the system architecture and interface presentation of the Geeseek, a blog Search Engine which applies the technology of text object extraction for blog pages. This system blongs to the new-style vertical Search Engine and is able to search for the blog home pages and blog article pages quickly and effectively. So far as we know, Geeseek is the first blog Search Engine in all the colleges in China. Key words: Internet, information explosion, Information Extraction, blog, HTML, machine learning, Search Engine, decision tree , Gee

文档评论(0)

1亿VIP精品文档

相关文档