- 1
- 0
- 约1.96万字
- 约 56页
- 2018-02-26 发布于江苏
- 举报
【计算机】CHAP3_DATA_EXPLORATION
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002 (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002 Data Mining: Exploring Data What is data exploration? Key motivations of data exploration include Helping to select the right tool for preprocessing or analysis Making use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools Related to the area of Exploratory Data Analysis (EDA) Created by statistician John Tukey Seminal book is Exploratory Data Analysis by Tukey A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook /div898/handbook/index.htm Techniques Used In Data Exploration In EDA, as originally defined by Tukey The focus was on visualization Clustering and anomaly detection were viewed as exploratory techniques In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory In our discussion of data exploration, we focus on Summary statistics Visualization Online Analytical Processing (OLAP) Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set. Can be obtained from the UCI Machine Learning Repository /~mlearn/MLRepository.html From the statistician Douglas Fisher Three flower types (classes): Setosa Virginica Versicolour Four (non-class) attributes Sepal width and length Petal width and length Summary Statistics Summary statistics are numbers that summarize properties of the data Summarized properties include frequency, location and spread Examples: location - mean spread - standard deviation Most summary statistics can be calculated in a single pass through the data Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 5
您可能关注的文档
- 【doc】团体健康险直付理赔服务模式初探.doc
- 【PPT】-中国人寿保险公司健康险话术专题(48页)-保险话术.ppt
- 【安邦】车险人伤理赔实务手册.doc
- 【全国百强校首发】黑龙江省哈尔滨市第三中学2016届高三上学期第二次检测数学(文)试题.docx
- 【大学信息技术导论】10 信息化与信息技术.ppt
- 【全国百强校首发】黑龙江省哈尔滨市第三中学2016届高三上学期第二次检测数学(理)试题.docx
- 【广发金工】成交量缩减,波动率C_P创新低-ETF期权每周跟踪(20150720-20150724).docx
- 【广发金工】标的下跌 波动率回升,Put价值凸显-ETF期权每周跟踪(20150727-20150731).docx
- 【广发金工】标的巨震引爆单日成交量-ETF期权每周跟踪(20150601-20150605).docx
- 【最新资料】水轮机调节复习资料.doc
- 2025-2026学年天津市和平区高三(上)期末数学试卷(含解析).pdf
- 2025-2026学年云南省楚雄州高三(上)期末数学试卷(含答案).pdf
- 2025-2026学年甘肃省天水市张家川实验中学高三(上)期末数学试卷(含答案).docx
- 2025-2026学年福建省厦门市松柏中学高二(上)期末数学试卷(含答案).docx
- 2025-2026学年广西钦州市高一(上)期末物理试卷(含答案).docx
- 2025-2026学年河北省邯郸市临漳县九年级(上)期末化学试卷(含答案).docx
- 2025-2026学年河北省石家庄二十三中七年级(上)期末历史试卷(含答案).docx
- 2025-2026学年海南省五指山市九年级(上)期末化学试卷(含答案).docx
- 2025-2026学年河北省唐山市玉田县九年级(上)期末化学试卷(含答案).docx
- 2025-2026学年河北省邢台市市区九年级(上)期末化学试卷(含答案).docx
原创力文档

文档评论(0)