数据挖掘基本分类方法重点.ppt

下载文档 降价啦

8
0
约2.55万字
约 101页
2017-03-11 发布于湖北
举报
版权申诉
保障服务

数据挖掘基本分类方法重点.ppt

1、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。。
2、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

Confidence Interval for Accuracy For large test sets N 30 , acc has a normal distribution with mean p and variance p 1-p /N Confidence Interval for p: Area 1 - ? Z?/2 Z1- ? /2 Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N 100, acc 0.8 Let 1-? 0.95 95% confidence From probability table, Z?/2 1.96 1-? Z 0.99 2.58 0.98 2.33 0.95 1.96 0.90 1.65 N 50 100 500 1000 5000 p lower 0.670 0.711 0.763 0.774 0.789 p upper 0.888 0.866 0.833 0.824 0.811 Comparing Performance of 2 Models Given two models, say M1 and M2, which is better? M1 is tested on D1 size n1 , found error rate e1 M2 is tested on D2 size n2 , found error rate e2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then Approximate: Comparing Performance of 2 Models To test if performance difference is statistically significant: d e1 – e2 d ~ N dt,?t where dt is the true difference Since D1 and D2 are independent, their variance adds up: At 1-? confidence level, An Illustrative Example Given: M1: n1 30, e1 0.15 M2: n2 5000, e2 0.25 d |e2 – e1| 0.1 2-sided test At 95% confidence level, Z?/2 1.96 Interval contains 0 difference may not be statistically significant Comparing Performance of 2 Algorithms Each learning algorithm may produce k models: L1 may produce M11 , M12, …, M1k L2 may produce M21 , M22, …, M2k If models are generated on the same test sets D1,D2, …, Dk e.g., via cross-validation For each set: compute dj e1j – e2j dj has mean dt and variance ?t Estimate: Computing Impurity Measure Split on Refund: Entropy Refund Yes 0 Entropy Refund No - 2/6 log 2/6 – 4/6 log 4/6 0.9183 Entropy Children 0.3 0 + 0.6 0.9183 0.551 Gain 0.9 ? 0.8813 – 0.551 0.3303 Missing value Before Splitting: Entropy Parent -0.3 log 0.3 - 0.7 log 0.7 0.8813 Distribute Instances Refund Yes No Refund Yes No Probability that Refund Yes is 3/9 Probability that Refu