- 1、本文档共10页,可阅读全部内容。
- 2、原创力文档(book118)网站文档一经付费(服务费),不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
- 3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。如您付费,意味着您自己接受本站规则且自行承担风险,本站不退款、不进行额外附加服务;查看《如何避免下载的几个坑》。如果您已付费下载过本站文档,您可以点击 这里二次下载。
- 4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“版权申诉”(推荐),也可以打举报电话:400-050-0827(电话支持时间:9:00-18:30)。
- 5、该文档为VIP文档,如果想要下载,成为VIP会员后,下载免费。
- 6、成为VIP后,下载本文档将扣除1次下载权益。下载后,不支持退款、换文档。如有疑问请联系我们。
- 7、成为VIP后,您将拥有八大权益,权益包括:VIP文档下载权益、阅读免打扰、文档格式转换、高级专利检索、专属身份标志、高级客服、多端互通、版权登记。
- 8、VIP文档为合作方或网友上传,每下载1次, 网站将根据用户上传文档的质量评分、类型等,对文档贡献者给予高额补贴、流量扶持。如果你也想贡献VIP文档。上传文档
查看更多
数据挖掘之数据预处理
Data preprocessing;outline;Data in the real world is dirty
Incomplete: occupation=“ ”
Noisy: Salary=“-10”
Inconsistent: M,F; 0,1;Why is Data Preprocessing Important?;Major Tasks in Data Preprocessing;;Data cleaning ;Missing values;How to Handle Missing Data?;Noisy data;How to Handle Noisy Data?;Binning method;How can we smooth out the data;Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,26, 28, 29, 34
* Partition into (equidepth) bins:
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9
-Bin 2: 23, 23, 23, 23
-Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34;Clustering;Regression;Combined computer and human inspection;Data integration and transformation;Data integration;data value conflicts;Redundancy in Data Integration;Data transformation;Normalization;;Suppose that the minimum and maximum values for the attribute income are $12000 and $98000, respectively. We would like map income to the range [0.0,1.0].By min-max normalization, a value of $ 73600 for income is transformed to ?(0.716);Suppose that the mean and stand deviation of the values for the attribute income are $54000 and $16000,respectively. With z-score normalization, a value of $73600 for income is transformed to ?(1.225)
;Suppose that the recorded values of A range from -986 to 917. the maximum absolute value of A is 986.To normalize by decimal scaling, we therefore divide each value by 1000(i.e., j=3) so that -986 normalizes to -0.986;attributes construction;Data reduction;Data cube aggregation;;Dimensionality reduction;Basic heuristic methods of attribute subset selection;;Data compression;Lossy data compression;Numerosity reduction;Histograms;Cluster Analysis;Sampling;;Discretization and concept hierarchy generation;Discretization;Concept hierarchies;A concept hierarchy for the attribute price;concept hierarchy generation for numeric data;concep
文档评论(0)