基于分块的网页信息提取算法分析及应用-analysis and application of web page information extraction algorithm based on block.docxVIP
- 10
- 0
- 约5.65万字
- 约 67页
- 2018-05-18 发布于上海
- 举报
基于分块的网页信息提取算法分析及应用-analysis and application of web page information extraction algorithm based on block
摘要随着Web的快速发展,如何从Web网站中获得需要的信息成为急需解决的问题,因此Web信息抽取成为必要,Web信息抽取技术也成为当今的一个研究热点。Web信息抽取需要解决的问题是:如何在复杂的页面结构中准确抽取有用的信息,并尽可能地减少人为参与。针对上述问题,目前出现了一种基于分块的Web信息抽取技术,即先将Web页面分成若干个相互独立的语义块,再根据不同的应用,从中选取具有相应语义特征的语义块进行信息抽取。新的抽取方式不仅有效降低了信息抽取问题的复杂度,而且大幅度提高了信息抽取的精确度。本文分析了目前网页分块的各种算法,包括HTML标签分析方法,VIPS分块算法,DOM树分块算法,重点研究了基于统计的Web页面分块算法,以及该算法在Web信息抽取中的应用。首先本文提出了一种基于HTML标签分布统计的Web页面分块算法MDSPS,详细描述了MDSPS的基本原理、实现过程,并与现有的两种经典Web页面分块算法:HTML分块解析方法和VIPS分块算法进行分析比较。其次,本文提出了块层次结构获取算法,能够根据MDSPS分块算法的分块结果,得到Web页面的块层次结构。本文同时给出了块语义特征分析方法,能够简单有效地提取块的语义特征。利用块语义特征分析方法,以块结构层次为基础,针对不同的Web应用,能够从大量的语义块中快速准确选取相应的特定语义块,提高Web信息抽取的准确度。最后,介绍上述分块算法在信息提取和网页分类中的应用。关键词:分块算法,标签统计,层次结构,语义特征分析,信息提取AbstractWiththerapiddevelopmentofWeb,howtogettheinformationyouneedbecomestobearesolvingproblem.Soitisnecessaryforwebinformationextraction,whichisalsoregardedasthehotspotoftheresearchatpresent.Theproblemneededtoberesolvedishowtomaketheinformationabstractionpreventtheinfluencefromthedifferenceandthechangeofpagesstructureandreducepersonsanticipationasfaraspossible.Aimingatresolvingproblem,anewtechnologyofwebinformationextractionbasedonblocksappears.Itsplitsthewebpagetoindependentsemanticblocks.Thenaccordingtodifferentapplication,itchoosesblockswhichhaverelativelysemanticcharactertoextractinformation.Thenewmethodnotonlyreducesthecomplexityofproblemsefficiently,butalsoimprovestheprecision.Theresearchstressofthepaperishowtodesignandimplementpagessegmentationalgorithmwhichisexact,automatic,intelligent,efficientandsimple.Firstly,thepaperproposespagesegmentationalgorithmbasedonHTMLmarkdistributionstatistics,describestheprincipleandtheprocessofimplementation,andcomparesitwithtwoclassicpagesegmentationalgorithms,whichareHTMLsegmentationanalysisalgorithmandVIPS.Secondly,thepaperproposesthealgorithmgettingtheblocksstructure,whichcouldgettheblocksstructureaccordingtotheresultofMDSPS.Thirdly,itproposestheanalysismethodofblockssemanticcharacters.Adoptingthemethod,itisabletoselectthespecificonesfromlotsofsemanticblocksfastandaccurately.Finally,It
您可能关注的文档
- 基于多终端协同多流并发传输控制方法及其实现-control method of multi-stream concurrent transmission base on multi-terminal cooperation and its implementation.docx
- 基于二类调查数据的县级森林碳储量及碳密度测算—以山东省泗水县为例-estimation of forest carbon storage and carbon density at county level based on class ii survey data - a case study of surabaya county, shandong province.docx
- 基于二阶段随机规划风-水-火动态经济调度策略-wind - water - fire dynamic economic dispatch strategy base on two-stage stochastic programming.docx
- 基于二取一系统的rtu设计与可靠性分析-rtu design and reliability analysis based on two - to - one system.docx
- 基于二进制特征点的图像拼接算法分析-analysis of image mosaic algorithm based on binary feature points.docx
- 基于二维差距的中国低碳技术转移对策分析-analysis of china's low-carbon technology transfer countermeasures based on two-dimensional gap.docx
- 基于二芳乙烯固态荧光开关的设计 合成及性质分析-design, synthesis and property analysis of solid-state fluorescent switch based on diarylethene.docx
- 基于二阶广义积分器锁相环的同步信号提取方法分析-analysis of synchronous signal extraction method based on second-order generalized integrator phase locked loop.docx
- 基于二维局部保留投影的人脸识别系统的分析-analysis of face recognition system based on 2d local reserved projection.docx
- 基于二维地图的变电站巡检机器人定位方法分析-analysis of substation inspection robot positioning method based on 2d map.docx
- 小区绿化施工协议书.docx
- 墙面施工协议书.docx
- 1 古诗二首(课件)--2025-2026学年统编版语文二年级下册.pptx
- (2026春新版)部编版八年级道德与法治下册《3.1《公民基本权利》PPT课件.pptx
- (2026春新版)部编版八年级道德与法治下册《4.3《依法履行义务》PPT课件.pptx
- (2026春新版)部编版八年级道德与法治下册《6.2《按劳分配为主体、多种分配方式并存》PPT课件.pptx
- (2026春新版)部编版八年级道德与法治下册《6.1《公有制为主体、多种所有制经济共同发展》PPT课件.pptx
- 初三教学管理交流发言稿.docx
- 小学生课外阅读总结.docx
- 餐饮门店夜经济运营的社会责任报告(夜间贡献)撰写流程试题库及答案.doc
原创力文档

文档评论(0)