- 3
- 0
- 约2.11万字
- 约 45页
- 2016-12-04 发布于辽宁
- 举报
互联网网页文本对象抽取实现技术
摘 要
关键词:
Implementation of text object extraction for Internet web pages
Author: Zhang Hui
Tutor: Lin Yaping
Abstract
Nowadays, there is a large number of semi-structural information which represents objects in the real world on the Internet. In order to deal with the severe challenge brought by information explosion, extract and integrate all kinds of text object information on web pages, and put up the object-level searching, it cries for the automated technologies to help people find the very information they really need among such a large number of information. The technology of text object extraction is just one of methods to solve this problem.
Based on the traditional theory of Information Extraction and aiming at the blog domain, this paper puts forward an arithmetic implementing the extraction function for the text objects of blog articles with the HTML features and machine learning. In this arithmetic, it analyses the features of blog pages, introduces an arithmetic for web page partition basing on the HTML tag features, uses decision tree to do statistics and training on the blog data set, tests and evaluates this arithmetic using the expert statistical tool, WEKA, and summarizes the advantages as well as the points needing improving. Finally, it shows the system architecture and interface presentation of the Geeseek, a blog Search Engine which applies the technology of text object extraction for blog pages. This system blongs to the new-style vertical Search Engine and is able to search for the blog home pages and blog article pages quickly and effectively. So far as we know, Geeseek is the first blog Search Engine in all the colleges in China.
Key words: Internet, information explosion, Information Extraction, blog, HTML, machine learning, Search Engine, decision tree , Gee
您可能关注的文档
最近下载
- 初中英语衡水体作文范文10篇.pdf VIP
- 基层工会预决算填报操作手册(2024.02).pdf
- 品胜加盟条件要领.pdf VIP
- 2026年党支部在改作风树新风等“对照5个方面”存在的问题及整改措施普通党员对照检查材料(五个对照)2篇例文.docx VIP
- 建筑公司员工手册(完整版).docx VIP
- 外研新交际英语(2024)新教材小学一年级英语下册Unit 1 课时3教学设计.docx VIP
- (正式版)DB12∕T 1361-2024 《地热尾水回灌技术规程》.pdf VIP
- 2026年教科版三年级科学下册(全册)教学设计(附教材目录).pdf VIP
- GB_T 25849-2024 移动式升降工作平台 设计、计算、安全要求和试验方法.pdf VIP
- 安全生产监管培训课件.pptx VIP
原创力文档

文档评论(0)