《文本相似度的计算的研究》-毕业论文（设计）.doc

下载文档

382
0
约2.83万字
约 44页
2018-12-03 发布于广西
举报
版权申诉
保障服务

《文本相似度的计算的研究》-毕业论文（设计）.doc

1、本文档共44页，可阅读全部内容。
2、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

摘要在中文信息处理中，文本相似度的计算广泛应用于信息检索，机器翻译，自动问答系统，文本挖掘，论文抄袭识别，其中的中文分词环节在搜索引擎，自然语言的处理中起着至关重要的作用，长期以来一直是人们研究的热点和难点。对于中文文本相似度计算，分词是基础和前提，采用高效的分词算法能够极大地提高文本相似度计算结果的准确性；分词中最关键的问题是消歧与未登陆词的识别，本文采用词性转换概率表来进行分词的消歧处理，使用有向拓补图的最短路径来进行分词的处理，得到了比较好的效果。在计算相似度的过程中使用了词频与词序相结合的方法，使用TF-IDF特征法和二部图的最大匹配来计算词频的相似度，但这种方法在颠倒句子中词的顺序时也会得到相同的相似度，必须使用一种能区分词序的算法，马尔科夫模型的状态转移矩阵表示一个词转移到另一词的概率(本文把单个词语作为马尔科夫模型中的一个状态来看待)，后在文本相似度计算中，使用一种将最长公共子序列、马尔科夫状态转移矩阵和TF．IDF相结合的算法得到结果。本文使用现代汉语词典与紫光输入法中提供的文本格式词库，来制作适合本项目用的特定格式的索引词库，极大地提高了分词的效率，词性的标注使用1998年人民日报的词性标注，最后测试使用新浪，搜狐，人民网，新华网等各大新闻网站的文本新闻作为测试数据集得到了较好的效果，较准确地统计了两文本文件的相同语数，相似度，并高亮显示相同的部分数据。关键词：文本相似度；马尔科夫模型；向量空间模型；中文分词；特征向量法 Abstract In the information processing,the calculation of text similarity has been applied widely in retrieval,machine translation,question answering systems,text mining,paper copy identify ,which the aspects of Chinese words plays a crucial role in the search engine and natural language processing,has long been a focus for researchers and difficult.For the Chinese text similarity computing,word segmentation is the foundation and prerequisite,the efficient segmentation algorithm is used to greatly improve the accuracy of text similarity calculations;The most critical issue of word segmentation is disambiguation and recognition of not landing word.The text uses part fo speech transition probability table for word disambiguation processing,use fo complement to the extension of the shortest path to carry out word processing,have been fairly good results.We used the method of combining the use of TF-IDF features of law and the maximum bipartite graph matching to calculate ther similarity of word frequency,however,which may return the same similarity of word with reversing order of words in sentences.Therefore,we should uses the method which can distinguish the order of the sentence.