- 1、原创力文档(book118)网站文档一经付费(服务费),不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。。
- 2、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。如您付费,意味着您自己接受本站规则且自行承担风险,本站不退款、不进行额外附加服务;查看《如何避免下载的几个坑》。如果您已付费下载过本站文档,您可以点击 这里二次下载。
- 3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“版权申诉”(推荐),也可以打举报电话:400-050-0827(电话支持时间:9:00-18:30)。
- 4、该文档为VIP文档,如果想要下载,成为VIP会员后,下载免费。
- 5、成为VIP后,下载本文档将扣除1次下载权益。下载后,不支持退款、换文档。如有疑问请联系我们。
- 6、成为VIP后,您将拥有八大权益,权益包括:VIP文档下载权益、阅读免打扰、文档格式转换、高级专利检索、专属身份标志、高级客服、多端互通、版权登记。
- 7、VIP文档为合作方或网友上传,每下载1次, 网站将根据用户上传文档的质量评分、类型等,对文档贡献者给予高额补贴、流量扶持。如果你也想贡献VIP文档。上传文档
查看更多
文本分类(Text categorization)
文本分类(Text categorization)
Commonly used classification algorithm for:
Decision trees, Rocchio, naive Bayes, neural networks, support vector machines, linear least squares fitting, kNN, genetic algorithms, maximum entropy, Generalized, Instance, Set, etc.. Here, pick only a few of the most representative algorithms, Kan kan.
Rocchio algorithm
The Rocchio algorithm should be the first and most intuitive solution for people to think about text categorization problems. The basic idea is to take a class in the average value of a sample document (such as all the Sports category in the document word Basketball the number of occurrences of a mean value, then the referee takes the average value, in order to do so), you can get a new vector image known as the center of mass, became the category centroid vector is the most representative said. When the new document needs to be judged, compare how much the new document is similar to the centroid (eight points, judging the distance between them) to determine that the new document does not belong to this class. A slightly modified Rocchio algorithm does not consider belong to this category of documents (called positive samples), also consider the document data does not belong to this category (called negative samples), the calculated centroid is as close as possible to the sample and try to stay away from negative samples. The Rocchio algorithm makes two deadly assumptions, making its performance surprisingly poor. One is that it around a category of documents only gathered in a mass, the actual situation is not so (this data is called linear inseparable); two is it is assumed that the training data is absolutely correct, because it does not have any quantitative measure of whether the sample contains the mechanism of noise, so it is no resistance to the wrong data.
But the Rocchio classifier is very intuitive, easy to be understood, algorithm is simple, there are still some value in use, is often used for the comparison of diffe
您可能关注的文档
- 康师傅赢得大陆市场靠时机和规避竞争主策略(Kangshifu wins the mainland market by timing and avoiding the competition strategy).doc
- 康熙妃嫔介绍(Kangxi introduced the concubines).doc
- 廉洁,社会永恒的价值追求。(Honesty, the eternal pursuit of value in society.).doc
- 建党90周年的光辉历程(Glorious course of founding a party 90th Anniversary).doc
- 建国60周年成就(The founding of 60th anniversary achievements).doc
- 建房(House building).doc
- 建材密度表(Building materials density table).doc
- 建材清单(Building materials list).doc
- 建材市场(Building materials market).doc
- 建立以能力素质模型为核心的hr体系(Establish the HR system based on Competency Model).doc
- 2025至2030中国移动治疗台行业发展研究与产业战略规划分析评估报告.docx
- 2025至2030链激酶行业细分市场及应用领域与趋势展望研究报告.docx
- 2025至2030爆炸物探测扫描仪行业市场占有率及有效策略与实施路径评估报告.docx
- 2025至2030四川省智能制造行业细分市场及应用领域与趋势展望研究报告.docx
- 2026届高三二轮复习试题政治大单元突破练1生产资料所有制与分配制度含解析.docx
- 2026届高三二轮复习试题政治大单元突破练16哲学基本思想与辩证唯物论含解析.docx
- 2026届高三二轮复习试题政治大单元突破练2社会主义市场经济体制含解析.docx
- 浙江省衢州市五校联盟2025-2026学年高二上学期期中联考技术试题-高中信息技术含解析.docx
- 浙江省金丽衢十二校2026届高三上学期11月联考政治试题含解析.docx
- 2026届高三二轮复习试题政治大单元突破练7领导力量:中国共产党的领导含解析.docx
最近下载
- 四川开放大学《灾难事故避险自救》终结性考核-100分.doc VIP
- 2025中企出海薪酬展望电子版.pdf VIP
- 广东工业大学《光电子技术》期末复习试卷.pdf VIP
- 注册会计师-会计-基础练习题-第七章资产减值-第一节资产减值概述.docx VIP
- 工业自动化软件:Rockwell Automation Logix5000二次开发all.docx VIP
- 注册会计师-会计-强化练习题-第七章资产减值.docx VIP
- GB_T 2518-2019 连续热镀锌和锌合金镀层钢板及钢带.docx VIP
- 上海政法学院《财务管理》2025 - 2026学年第一学期期末试卷.docx VIP
- 数学作业本 七年级上 浙教版.pptx VIP
- 实验室认可资质认定内审员培训.pptx VIP
原创力文档


文档评论(0)