基于分块的网页信息提取算法分析及应用-analysis and application of web page information extraction algorithm based on block.docxVIP

下载本文档

10
0
约5.65万字
约 67页
2018-05-18 发布于上海
举报

基于分块的网页信息提取算法分析及应用-analysis and application of web page information extraction algorithm based on block.docx

基于分块的网页信息提取算法分析及应用-analysis and application of web page information extraction algorithm based on block

摘要随着Web的快速发展，如何从Web网站中获得需要的信息成为急需解决的问题，因此Web信息抽取成为必要，Web信息抽取技术也成为当今的一个研究热点。Web信息抽取需要解决的问题是:如何在复杂的页面结构中准确抽取有用的信息，并尽可能地减少人为参与。针对上述问题，目前出现了一种基于分块的Web信息抽取技术，即先将Web页面分成若干个相互独立的语义块，再根据不同的应用，从中选取具有相应语义特征的语义块进行信息抽取。新的抽取方式不仅有效降低了信息抽取问题的复杂度，而且大幅度提高了信息抽取的精确度。本文分析了目前网页分块的各种算法，包括HTML标签分析方法，VIPS分块算法，DOM树分块算法，重点研究了基于统计的Web页面分块算法，以及该算法在Web信息抽取中的应用。首先本文提出了一种基于HTML标签分布统计的Web页面分块算法MDSPS，详细描述了MDSPS的基本原理、实现过程，并与现有的两种经典Web页面分块算法:HTML分块解析方法和VIPS分块算法进行分析比较。其次，本文提出了块层次结构获取算法，能够根据MDSPS分块算法的分块结果，得到Web页面的块层次结构。本文同时给出了块语义特征分析方法，能够简单有效地提取块的语义特征。利用块语义特征分析方法，以块结构层次为基础，针对不同的Web应用，能够从大量的语义块中快速准确选取相应的特定语义块，提高Web信息抽取的准确度。最后，介绍上述分块算法在信息提取和网页分类中的应用。关键词：分块算法，标签统计，层次结构，语义特征分析，信息提取AbstractWiththerapiddevelopmentofWeb,howtogettheinformationyouneedbecomestobearesolvingproblem.Soitisnecessaryforwebinformationextraction,whichisalsoregardedasthehotspotoftheresearchatpresent.Theproblemneededtoberesolvedishowtomaketheinformationabstractionpreventtheinfluencefromthedifferenceandthechangeofpagesstructureandreducepersonsanticipationasfaraspossible.Aimingatresolvingproblem,anewtechnologyofwebinformationextractionbasedonblocksappears.Itsplitsthewebpagetoindependentsemanticblocks.Thenaccordingtodifferentapplication,itchoosesblockswhichhaverelativelysemanticcharactertoextractinformation.Thenewmethodnotonlyreducesthecomplexityofproblemsefficiently,butalsoimprovestheprecision.Theresearchstressofthepaperishowtodesignandimplementpagessegmentationalgorithmwhichisexact,automatic,intelligent,efficientandsimple.Firstly,thepaperproposespagesegmentationalgorithmbasedonHTMLmarkdistributionstatistics,describestheprincipleandtheprocessofimplementation,andcomparesitwithtwoclassicpagesegmentationalgorithms,whichareHTMLsegmentationanalysisalgorithmandVIPS.Secondly,thepaperproposesthealgorithmgettingtheblocksstructure,whichcouldgettheblocksstructureaccordingtotheresultofMDSPS.Thirdly,itproposestheanalysismethodofblockssemanticcharacters.Adoptingthemethod,itisabletoselectthespecificonesfromlotsofsemanticblocksfastandaccurately.Finally,It

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

基于分块的网页信息提取算法分析及应用-analysis and application of web page information extraction algorithm based on block.docxVIP