- 2
- 0
- 约4.23万字
- 约 56页
- 2018-07-31 发布于上海
- 举报
深层网中查询入口的填充策略分析-analysis of filling strategy for query entry in deep web
摘要目前搜索引擎索引的绝大部分是表层网的信息,限于一些技术原因,搜索引擎几乎无法索引深层网中的信息。但是深层网具有容量大、质量高和专业性强等诸多优点,它的意义及重要性无法被人们忽略,于是找到一种能够爬行深层网的方式是非常必要的,所以构造一个深层网爬行器来获取深层网中的数据是非常有意义的,而表单自动填充是深层网爬行器的重要组成部分。本文首先介绍了深层网的价值及难以搜索深层网的原因,分析对比了国内外研究现状,介绍了HTML表单、文档对象模型(DOM)、抽取方法、本体知识和相似度计算方法,在此基础上本文提出了一套填充深层网入口表单的策略。首先使用改进的启发式规则识别深层查询入口表单,再通过本文提出的就近原则算法提取表单标签,在进行最后的匹配填充之前对抽取到的标签进行标准化,最后通过改进的基于语义的相似度匹配算法对深层网表单标签和本体领域知识库的属性进行匹配,这样就可以模拟用户填充深层网入口表单的过程了。结尾对整个算法进行了实验验证。选取了图书领域的深层网入口表单进行实验,先识别表单查询入口,实验结果表明使用本文总结的启发式规则准确率能达到90.76%。对表单提取时,使用就近原则算法提取表单标签的准确率能达到94.23%。接着,使用改进的基于语义相似度计算算法寻找与表单标签相匹配的属性,找到匹配的属性之后,用属性的值对表单控件进行填充。结果表明,匹配的成功率达到88.83%,填充的成功率达到95.43%。也就是说,本文提出的填充深层网入口表单的策略是有效的。关键词:深层网,查询入口,表单填充AbstractAtpresent,limitedtosometechnicalreasons,generalsearchenginescanonlyindextheinformationonthesurfacewebinsteadofthedeepweb.However,deepwebisofgreatadvantage,suchaslargecapacity,highqualityandprofessionalcharacter,etc.Thus,itsimportanceandinfluenceshouldnotbeignored.Anditisrathernecessarytosearchforanapproachtocrawlthedeepweb.Therefore,itisgreatlysignificanttoconstructadeepwebcrawler,ofwhichautomaticformfillisanessentialpart,togainthedataonthedeepWeb.Thisthesisfirstintroducesthevalueofthedeepwebandthereasonwhysearchingonthedeepwebisdifficult,analyzesandcomparesthestudyofthecaseathomeandabroad.ItalsointroducestheHTMLform,DocumentObjectModel(DOM),Ontologyknowledgeandextractionmethod.Onthebasis,theauthorproposesastrategyoffillingaqueryentranceofthedeepweb.Firstly,theauthorusesheuristicrulestoidentifythoseformsindeepweb.Secondly,withthealgorithmofthenearestprincipia,theauthorextractsthoselabelsofform.Beforefillingthoseformsrespectively,standardizingthoselabelsisadopted.Atlast,employingthealgorithmbasedonimprovedontologysimilarmatching,theauthormatchesthelabelofformwiththeattributeofsemanticdomainwarehouse.Inthisway,wecansimulatetheprocessofusertofilltheformsofdeepweb.Attheendofthepaper,thealgorithmproposedisverifiedthoroughtheexperiment.Thosewebsitesfromlibrarydomainismadeuseof.Thefirststepistoidentifythosequeryentranceofform
您可能关注的文档
- 设计的文化识别——神话元素对本土设计的视觉构成之探讨-cultural identification of design - discussion on the visual composition of mythological elements to local design.docx
- 设计mega press序列在体检测脑内γ 氨基丁酸的研究-study on design of mega press sequence for in vivo detection of γ -aminobutyric acid in brain.docx
- 设计信息的序分析——简析设计信息的结构与传递-sequence analysis of design information - analysis of structure and transmission of design information.docx
- 设计模式在rosereplicator gui中的应用-application of design pattern in rose replicator gui.docx
- 设计megapress序列在体检测脑内γ氨基丁酸的研究-study on the in vivo detection of γ -aminobutyric acid in brain by designing megapress sequence.docx
- 设计与验证基于1394b协议ieee1394物理层仲裁机制-design and verification of ieee 1394 physical lay arbitration mechanism based on 1394 b protocol.docx
- 设立境内外资管理船舶基金法律组织形式分析-analysis on the legal organization form of establishing domestic and foreign investment management ship fund.docx
- 设立保税物流园区对重庆区域经济可持续发展的分析-analysis on the establishment of bonded logistics park to the sustainable development of chongqing's regional economy.docx
- 设立村镇银行对商业银行发展转型的经济分析——以浦发银行为例-economic analysis on the development and transformation of commercial banks by setting up village banks - taking pudong development bank as an example.docx
- 设计中的沟通——基于艺术设计的沟通分析-communication in design - analysis of communication based on artistic design.docx
最近下载
- 中华人民共和国职业分类大典(2015新版)解读.docx VIP
- XX生物质热电项目安全设施设计专篇.doc VIP
- 广东省珠海市香洲区2024-2025学年三年级上册期末考试数学试卷(含答案).docx VIP
- 云南省红河州、文山州2024-2025学年高一上学期月期末考试历史试题(含答案).pdf VIP
- 生锈之8D报告(案例).xls VIP
- (2025年)统战部遴选公务员面试真题和专业题3问及答案.docx VIP
- 口腔市场专员培训.pptx VIP
- 如何制作西红柿炖牛腩,比饭店那酸酸的好吃太多啦!.pdf VIP
- 部编版二年级上册语文第11课《葡萄沟》教学课件.ppt VIP
- 部编版二年级上册语文《葡萄沟》PPT教学课件.pptx VIP
原创力文档

文档评论(0)