基于分类词典的文本相似性度量方法.docVIP

下载本文档

17
0
约1.38万字
约 13页
2018-01-03 发布于河北
举报
版权申诉

基于分类词典的文本相似性度量方法.doc

1、本文档共13页，可阅读全部内容。
2、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。
5、该文档为VIP文档，如果想要下载，成为VIP会员后，下载免费。
6、成为VIP后，下载本文档将扣除1次下载权益。下载后，不支持退款、换文档。如有疑问请联系我们。
7、成为VIP后，您将拥有八大权益，权益包括：VIP文档下载权益、阅读免打扰、文档格式转换、高级专利检索、专属身份标志、高级客服、多端互通、版权登记。
8、VIP文档为合作方或网友上传，每下载1次，网站将根据用户上传文档的质量评分、类型等，对文档贡献者给予高额补贴、流量扶持。如果你也想贡献VIP文档。上传文档

基于分类词典的文本相似性度量方法

基于分类词典的文本相似性度量方法李海林邹金串华侨大学信息管理系华侨大学现代应用统计与大数据研究中心 X 关注成功！加关注后您将方便地在我的关注中得到本文献的被引频次变化的通知！新浪微博腾讯微博人人网开心网豆瓣网网易微博摘????要：针对现有基于语义知识规则分析的文本相似性度量方法存在时间复杂度高的局限性, 提出基于分类词典的文本相似性度量方法。利用汉语词法分析系统ICTCLAS对文本分词, 运用TF×IDF方法提取文本关键词, 遍历分类词典获取关键词编码, 通过计算文本关键词编码的近似性来衡量原始文本之间的相似度。选取基于语义知识规则和基于统计两个类别的相似性度量方法作为对比方法, 通过传统聚类与KNN分类分别对相似性度量方法进行效果验证。数值实验结果表明, 新方法在聚类与分类实验中均能取得较好的实验结果, 相较于其他基于语义分析的相似性度量方法还具有良好的时间效率。关键词：文本挖掘; 语义分析; 分类词典; 关键词提取; 词语编码; 相似性度量; 聚类; 分类; 作者简介：李海林, 男, 1982年生, 副教授, 博士, 主要研究方向为数据挖掘与决策支持, 主持国家自然科学基金1项和省部级基金2项, 发表学术论文40余篇, 其中被SCI检索11篇, EI检索20余篇。作者简介：邹金串, 女, 1993年生, 硕士研究生, 主要研究方向为文本挖掘。E-mail:Zou_jinchuan@ 163.com. 收稿日期：2016-08-30 基金：国家自然科学基金项目 Text similarity measure method based on classified dictionary LI Hailin ZOU Jinchuan Department of Information Systems, Huaqiao University; Research Center of Applied Statistics and Big Data, Huaqiao University; Abstract： Existing text-similarity measurement methods based on the semantic knowledge rules analysis have the limitation of high time complexity. In this paper, we propose a text-similarity measurement method based on the Classified Dictionary. First, we segmented texts using the Chinese Lexical Analysis System. Then, we extracted text keywords using the term frequency-inverse document frequency ( tf* idf) method and performed keywords coding by traversing the dictionary. By calculating the coding similarity of the text keywords, we can determine the similarity of the original texts. As our two comparison methods, we selected similarity measurement methods based on semantic knowledge rules and statistics. We verified our similarity measurement results using traditional clustering algorithms and the k-nearest neighbors classification method. Our numerical results show that our proposed method can obtain relatively good results in clustering and classification experiments. In addition, compared with other semantic analysis measurement methods, this method has better time efficiency. K