通往大数据的卓越之路.pptx

大数据分析师的卓越之道数据分析的典型场景DataValueDataKnowledgeDiscoveryValueInfrastructure新的世界观:不确定的世界大数据的测不准《自然》:测不准《科学》:大数据傲慢数据分析方法论的升级HypothesesCollectionPreparationAnalyticsInterpretationEvaluation数据分析方法论的升级HypothesesCollectionPreparationAnalyticsInterpretationEvaluationHypotheses机械地发掘相关性和假设直觉,拿侦探小说练手阅读广泛涉猎跨界思维碰撞融入业务部门防止数据采集与分析、业务与数据分析的脱节数据分析方法论的升级HypothesesCollectionPreparationAnalyticsInterpretationEvaluation数据!数据!数据!n=All !Enterprise Data Warehouse ? Enterprise Data Hub/Data LakeExternal data sourcesStructured ? semi-structured ? unstructuredLog analysisText analysisImage/videoData with geo and temporal tagsNetworks and graphs数据?数据?数据?n=All ?More data vs. sampling“Raw data” is an oxymoronSignals and noisesSampling biasData exchange and sharingData rights, data pricingData lifecycle managementProvenance capture, representation, and queryingSometimes data are not assets, but costs数据分析方法论的升级HypothesesCollectionPreparationAnalyticsInterpretationEvaluation数据质量:重中之重Noisy, biased and polluted data are unavoidableGoal: models = components for noise + relatively complex models for signalCleansing, validation, …Can it start with a small subset? Can the process be automated?Work together with visualization, machine learningCuration, Wrangling, …Automated learning to discover structure, resolve entities, and transform data数据表示Reduce compute and communication complexitySparse, compressed data structureApproximate computationReduce statistical complexityDimensionality reduction, clusteringSamplingNon-random sampling, compressive sensing, … …Choose best representation for specific computational methodsE.g. tables for data parallelism, networks/graphs for graph parallelismUIMA: Unstructured Information Management Architecture数据分析方法论的升级HypothesesCollectionPreparationAnalyticsInterpretationEvaluationComputational ScienceSource: 检查自身装备检查自身装备ML PipelineScikit-learn style pipelines拥抱云的世界all models are wrong, but some are useful刺猬(一招鲜吃遍天) vs. 狐狸(一把钥匙开一把锁)模型的复杂度与问题匹配:奥卡姆剃刀原理如何做到数据越多、边际收益越大?数据不可名状的功效:简

文档评论(0)

1亿VIP精品文档

相关文档