半监督学习论文-基于标记样本及相似度调整的k均值算法在文本聚类中的应用.doc

下载文档

2
0
约3.35千字
约 4页
2017-05-02 发布于四川
举报
版权申诉
保障服务

半监督学习论文-基于标记样本及相似度调整的k均值算法在文本聚类中的应用.doc

1、本文档共4页，可阅读全部内容。
2、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

半监督学习论文-基于标记样本及相似度调整的k均值算法在文本聚类中的应用

半监督学习论文：基于标记样本和相似度调整的k均值算法在文本聚类中的应用【中文摘要】在机器学习领域的众多实际应用中,获得标记样本通常需要付出较大的代价。在一些情况下,获得所有的类标记是非常困难的。近年来,半监督学习已经成为机器学习领域的一个研究热点。半监督学习同时利用标记样本和无标记样本来指导学习过程,从而获得更好的学习性能。有关半监督学习的研究可大致分为两类,即半监督分类和半监督聚类。半监督聚类也就是无监督学习,就是使用少量的标记样本对无标记样本的聚类过程进行指导。本文对聚类的相关技术和半监督学习进行了研究,介绍了文本数据的预处理、距离公式、聚类算法评估准则以及几种k-means算法的扩展算法。本文用随机选择的标记样本作为监督信息,并将标记样本转化为Must-link约束集和Cannot-link约束集,用于重构样本集合的相似度矩阵,重新确定样本之间的相似或不相似的标准。k-means++算法提供了一种有效的聚类种子的选择方法,这种方法的可以降低算法对聚类种子敏感的缺点,其聚类精度明显优于传统的随机选择种子的方法。本文在k-means++算法在初始质心的选择过程中加入了标记样本的影响,提出了一种基于标记样本和相似度调整的k-means算法,并在20-newsgroup和Spam两个数据集上进行了测试,实验结果表明本文提出的算法在聚类结果的精度和执行效率上比Seeded k-means算法和k-means++算法有更好的表现。【英文摘要】In many applications field of machine learning, the availability of data tags is usually requires more costly. In some cases, it is very difficult to access to all kinds of the class tags. In recent years, semi-supervised learning has become a research focus in the machine learning field, Semi-supervised learning taking advantage of labeled samples and unlabeled samples to guide the learning process, leading to better learning performance. Research on semi-supervised learning can be divided into two categories, namely semi-supervised classification and semi-supervised clustering. Semi-supervised clustering is to use a small amount of labeled samples and unlabeled samples to guide the clustering process. We studied the clustering of related technology and semi-supervised, introduced the text data preprocessing, distance metrics, the assessment of clustering algorithm and the k-means clustering algorithm based on the constraints.The supervised information is labeled samples selected from collection randomly, these labels are transferred into the Must-link constraint set and the Cannot-link constraints set for the reconstruction of the similarity matrix of the collection, sample re-established the standards of similar or dissimilar among samples. k-means++ algorithm provides an effective