Chapter 2 Data Preprocessing 数据挖掘:概念与技术--PPT 英文版.pptVIP

  • 14
  • 0
  • 约2.85万字
  • 约 73页
  • 2018-01-25 发布于浙江
  • 举报

Chapter 2 Data Preprocessing 数据挖掘:概念与技术--PPT 英文版.ppt

Chapter 2 Data Preprocessing 数据挖掘:概念与技术--PPT 英文版

* Data Mining: Concepts and Techniques * Discretization and Concept Hierarchy Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute Concept hierarchy formation Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior) * Data Mining: Concepts and Techniques * Discretization and Concept Hierarchy Generation for Numeric Data Typical methods: All the methods can be applied recursively Binning (covered above) Top-down split, unsupervised, Clustering analysis (covered above) Either top-down split or bottom-up merge, unsupervised Entropy-based discretization: supervised, top-down split Interval merging by ?2 Analysis: unsupervised, bottom-up merge Segmentation by natural partitioning: top-down split, unsupervised * Data Mining: Concepts and Techniques * Entropy-Based Discretization Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the information gain after partitioning is Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is where pi is the probability of class i in S1 The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization The process is recursively applied to partitions obtained until some stopping criterion is met Such a boundary may reduce data size and improve classification accuracy * Data Mining: Concepts and Techniques * Interval Merge by ?2 Analysis Merging-based (bottom-up) vs. splitting-based methods Merge: Find the best neighboring intervals and merge them to form larger intervals recursively ChiMerge [Kerber AAAI 1992

您可能关注的文档

文档评论(0)

1亿VIP精品文档

相关文档