- 13
- 0
- 约2.17万字
- 约 78页
- 2020-03-25 发布于浙江
- 举报
统计机器学习与数据挖掘技术与方法研讨班讲义 kNN vs. Naive Bayes Bias/Variance tradeoff Variance ≈ Capacity kNN has high variance and low bias. Infinite memory NB has low variance and high bias. Decision surface has to be linear (hyperplane) Summary Categorization Training data Over-fitting Generalize Na?ve Bayes Bayesian Methods Bernoulli NB classifier Multinomial NB classifier K-Nearest Neighbor Bias .vs. Variance Feature selection Chi-square test Mutual Information Readings [1] IIR Ch13, Ch14.2 [2] Y. Yang and X. Liu, A re-examination of text categorization methods, presented at Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR99), 1999. Classification Evaluation Most (over)used data set 21578 documents 9603 training, 3299 test articles (ModApte split) 118 categories An article can be in more than one category Learn 118 binary category distinctions Average document: about 90 types, 200 tokens Average number of classes assigned 1.24 for docs with at least one category Only about 10 out of 118 categories are large Common categories (#train, #test) Evaluation: Classic Reuters Data Set Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Reuters Text Categorization data set (Reuters-21578) document REUTERS TOPICS=YES LEWISSPLIT=TRAIN CGISPLIT=TRAINING-SET OLDID=12981 NEWID=798 DATE 2-MAR-1987 16:51:43.42/DATE TOPICSDlivestock/DDhog/D/TOPICS TITLEAMERICAN PORK CONGRESS KICKS OFF TOMORROW/TITLE DATELINE CHICAGO, March 2 - /DATELINEBODYThe American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future directio
您可能关注的文档
最近下载
- 北京市5年(2021-2025)高考物理真题分类汇编:专题10 电磁感应(原卷版).docx VIP
- 2026年广东事业单位招聘(职测)笔试真题及答案.docx VIP
- 译林版(2024)新教材小学四年级英语下册Unit 2 第3课时 Wrap-up教学设计.docx VIP
- 2026年六安职业技术学院单招职业技能考试题库附参考答案详解(实用).docx VIP
- ZeynepTufekci_2016T[图费克奇][机器智能时代_坚守人类道德更重要].pdf VIP
- 译林版(2024)新教材小学四年级英语下册Unit 2 第3课时 Wrap-up教学课件.pptx VIP
- 小学劳动教育课程开发指南.docx VIP
- 举一反三-奥数第5周:算式之谜.ppt VIP
- 食道静脉曲张套扎术护理ppt.pptx
- T_CACM 1355-2021 中医穴位贴敷基层临床应用技术操作规范.docx VIP
原创力文档

文档评论(0)