渐进式中文文本分类技术研究-计算机应用专业论文.docxVIP

下载本文档

5
0
约3.08万字
约 39页
2019-02-19 发布于上海
举报

渐进式中文文本分类技术研究-计算机应用专业论文.docx

Abstract摘 Abstract 摘要 K-近邻是基于统计的分类方法。K．近邻算法是数据挖掘分类算法中比较常用的一种方法，其基本思想是：给一篇待识别的文章，系统在训练集中找到最近的K个近邻，看这K个近邻中多数属于哪一类，就把待识别的文章归为哪一类。 K．近邻分类器是基于懒惰学习方法的，因为它实际并没有(根据所给训练样本)构造一个分类器，而是将所有训练样本首先存储起来，当要进行分类时，临时进行计算处理。与积极学习相比，当训练样本或者测试样本数目迅速增加时，就会导致K．近邻的计算量迅速增加。所以，它比积极学习方法的速度慢得多，但是，就分类准确性而言，懒惰学习是有着很大优势的。本文利用了近邻思想的准确性高的优势，同时针对它在分类速度上的不足，特提出了渐进式的文本分类思想。利用文本的标题、关键词、重点段落、全文进行渐进式的分类处理。这样，如果不用全文就能分类成功，就大大提高了文本分类的速度，从而也就达到了我们提高文本分类效率的目的。实验数据表明，该方法具有较高的分类效率和准确率。关键词K一近邻；渐进式思想；文本分类 AbstmctAbstract Abstmct Abstract K-NN is a method of classifying based on statistics．K-Nearest neighbor algorithm is a kind of common methods in data mining．Its basic idea like this： When there is a discriminating article，the system want to find K nearest neighbors in the exercise set，And then we should find out the class that the most of these K nesrest neighbors belonging to．So the article belongs to this class． K—NN，algorithm is a kind of indolent study means as it doesn’t make real classifier．It is only save all the exercises at first，then picks them out to compute at time when classifying．Compare to active study，when the numbers of the exercise samples increasing straightforward，it will take more and more time to compute．So as to the speed，is more slowly than active study．But flS to study,it has dominant position than active study． This article takes advantage of the predominance of the K—NN’S in the nicety． At the SalTle time，it contraposes the shortage of the K—NN’S in the rate．Then this article puts forward the gradual thinking．When classifying，it uses the text’S title、 keywords、many important paragraphs、whole text step by step．If we classify successfully by using hereinbefore information，then we enhance the rate of the text classifying．The data from experiments indicate that this methord has higher rate and nicety in classifying． Keywords：K‘nearest neighbor；Gradual thinking；Text mining 第l章绪论第l章绪论第l章绪论第l章绪论 1．1国内外在该方向的研究现状数据挖掘(Data Mil3ing，简称DM)，简单地讲就是从大量数据中挖掘或抽取出知识，又称为数据库中知识发现(Knowledg

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

渐进式中文文本分类技术研究-计算机应用专业论文.docxVIP