基于抽样的云频繁项集挖掘算法分析-analysis of cloud frequent itemsets mining algorithm based on sampling.docxVIP

下载本文档

8
0
约5.49万字
约 68页
2018-05-18 发布于上海
举报

基于抽样的云频繁项集挖掘算法分析-analysis of cloud frequent itemsets mining algorithm based on sampling.docx

基于抽样的云频繁项集挖掘算法分析-analysis of cloud frequent itemsets mining algorithm based on sampling

基于抽样的云频繁项集挖掘算法研究摘要随着数据收集技术的发展，海量数据时代已经到来。当今社会商业竞争异常激烈，人们迫切希望从海量数据中，提取有用的信息以帮助进行商业决策。但是，传统的数据分析和数据挖掘技术在处理海量数据时，时间和空间的代价过大，很难满足人们的需求。例如，数据挖掘中传统的频繁项集挖掘需要多次扫描数据集，消耗大量时间；还需要存储大量的候选项集，消耗大量内存。数据收集技术发展的同时，海量数据处理技术也以高并发、低成本的处理优势高速发展。近几年，以Hadoop生态系统发展最具代表性。Hadoop项目主要由两部分组成：HDFS和mapreduce，它们分别是Google FileSystem和GoogleMapReduce的开源实现。Hadoop分布式框架主要是以廉价的商业机器为计算节点构成云平台，达到高效处理海量数据的目的。将数据挖掘和Hadoop框架有机结合，利用Hadoop优秀的海量数据处理能力进行挖掘，将会给数据挖掘带来新的活力。本文主要针对数据挖掘中频繁项集挖掘和Hadoop框架相结合，做了以下工作：（1）对Hadoop平台进行深入的研究和分析。Hadoop平台的最核心的两个部分是：用于海量数据存储的HDFS分布式文件系统和用于数据处理的Mapreduce并行编程框架。两者相辅相成，构成了Hadoop分布式框架。（2）为了进一步提高频繁项集挖掘效率，提出了一种基于Hadoop 平台的并行抽样算法。这种算法利用mapreduce编程框架，单次扫描海量数据即可实现随机抽样。在抽样的过程中，还可以完成对数据的清理工作。（3）对传统频繁项集挖掘算法进行深入的研究后，提出了一种基于抽样的频繁项集并行发现算法。该算法基于Hadoop平台，充分发挥其处理海量数据的优势，实验证明该算法具有良好的挖掘性能。关键词：数据挖掘；频繁项集；Hadoop；MapreduceTheResearchofCloudFrequentItemsetsMining AlgorithmWhichBasedonSampleAbstractWith the development of data collection technology, the era of massivedataiscoming.Businesscompetitionisfierceintodayssociety, peopleareeagering to extract useful informations from massive data which help them to make correct business decisions.However, the traditionaldataanalysisanddataminingtechniques aredifficulttomeet the demand of people in dealing with massive data, because of the excessive high cost of times and spaces. For example, the traditional frequentitemsetsminingneedstoscandatasetssomanytimesthatcosta lot of times.And it also needs to store a large number of candidate itemsets,whichconsumeslargeamountofmemories.Atthe same time , cloud computing with high concurrency and low costof mass data processing,is developing with high speed. In recent years, Hadoop ecosystem’s development is the most representative. Hadoopis mainly composed of two parts: HDFS and Mapreduce. It uses cheap commercial machinesas compute nodes to constitute a cloud platformwhichcanefficientprocessingmassivedata.Combinedata mining with cloud computing, thismeans using the advandage of cloud

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

基于抽样的云频繁项集挖掘算法分析-analysis of cloud frequent itemsets mining algorithm based on sampling.docxVIP