一种基于KNN的文本分类算法-CORE.PDFVIP

  • 10
  • 0
  • 约9.98千字
  • 约 3页
  • 2019-08-18 发布于天津
  • 举报
一种基于KNN的文本分类算法-CORE.PDF

ISSN1009-3044 E-mail: xsjl@ Computer Knowledge and Technology 电脑知识与技术 第8 卷第7 期 (2012 年3 月) Computer Knowledge and Technology 电脑知识与技术 Vol.8, No.7, March 2012. Tel:+86-551-5690963 5690964 一种基于KNN 的文本分类算法 余悦蒙,黄小斌 (厦门大学信息科学与技术学院,福建厦门361005) 摘要:KNN(K-Nearest Neighbor)是向量空间模型中最好的文本分类算法之一。但是,当样本集较大以及文本向量维数较多时, KNN 算法分类的效率和准确率就会大大降低。该文提出了一种提高KNN 分类效率的改进算法,并且改进了相似度的计算方法, 能更准确的判断维数高且样本集大的文本向量。算法在训练过程中计算出各类文本在向量空间中的分布范围,在分类过程中,根 据待分类文本向量在样本空间中的分布位置,缩小其K 最近邻搜索范围。实验证实改进的算法可以在保持KNN 分类性能基本不 变的情况下,显著提高分类效率。 关键词:文本分类;K-最近邻;算法 中图分类号:TP301 文献标识码:A 文章编号:1009-3044(2012)07-1564-03 AnAlgorithmforTextClassificationBasedonKNN YU Yue-meng, HUANG Xiao-bin (School of Information Science and Engineering, Xiamen University, Xiamen 361005, China) Abstract: KNN (K-Nearest Neighbor) is one of the best text classification algorithms by Vector Support Model. However, its efficiency and accuracy rate are very low for text classification task with high dimension and huge samples. In this paper, a new algorithm is intro⁃ duced to improve the efficiency rate. For high precision, we also have a new way to compute the similarity of two texts. The distribution of training samples of each class is computed in the training process. According to the position of the documents in the sample space, this al⁃ gorithm can reduce the searching range of their K nearest neighbors in the classing process. The results of experiments show that this algo⁃ rithm can save largely the classification time and has almost the same classification performance as that of the traditional KNN classification algorithm. Keywords:text classification; KNN; algorithm 互联网的迅速发展使我们人类进入了信息的时代。信息时代的到来亦让世界范围的信息量迅猛增

文档评论(0)

1亿VIP精品文档

相关文档