SparkSQL的使用经验分享.pptx

  1. 1、本文档共15页,可阅读全部内容。
  2. 2、原创力文档(book118)网站文档一经付费(服务费),不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
  3. 3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。如您付费,意味着您自己接受本站规则且自行承担风险,本站不退款、不进行额外附加服务;查看《如何避免下载的几个坑》。如果您已付费下载过本站文档,您可以点击 这里二次下载
  4. 4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“版权申诉”(推荐),也可以打举报电话:400-050-0827(电话支持时间:9:00-18:30)。
查看更多
基于Spark/hbase的数据分析平台及 SparkSQl使用经验分享 BUILD FOR ANALYTICS We built a Distributed Graph Database for analytical use cases Fast traversal + Range query + Batch processing Need to take care of the interfaces to users (Analysts) Query Language? (Analysts says SQL please!) Interactive data exploration? Visualization? Dynamic charts? And be flexible to do all sorts of powerful things 3 BUILD FOR SCALABILITY Need to be fully scalable and optimized for Data size (100TB planned, 1PB next year) Ingestion speed (parallel insertion + dynamic splits) Parallel and Batch queries Offline + Online processing Mining 4 ARCHITECTURE DEMO 5 6 7 SPARKSQL 经验分享: Before You Start Build a latest Spark from branch-1.2! Lots of new features, bug fixes Parameters are set to a better default Join optimizations! 8 SPARKSQL 经验分享: 数据格式 Use Parquet (Columnar format, dictionary encoding, compression) Data size dropped 40%! when converting from old SequenceFile (supposed to be highly compressed) Columnar format avoids unnecessary IO and deserialization vs (30s vs 25min) 9 SPARKSQL 经验分享: 数据格式 Conversion is easy if you already have a Pig/Hive loader for your old data 10 SPARKSQL 经验分享: Load Parquet in HiveContext /huangjs/683a4c85ae14e9ae205b# Note: Column names has to match Parquet schema. Tip: Use CTAS (View) for Column renaming. Try setting spark.sql.hive.convertMetastoreParquet to false to use Hive’s Parquet implementation if not working 11 SPARKSQL 经验分享: Broad-cast Join Optimization Biggest headache : Join optimization support is still far from mature Cannot specify which join algorithm to use In Hive, we can add /*+ MAPJOIN(…) */ Not supported in SparkSQL HiveContext It relies on table metadata retrieved by Use “DESC EXTENDED table” to check metadata 12 SPARKSQL 经验分享: Broad-cast Join Optimization If…{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}…works! If…{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDat

文档评论(0)

a13355589 + 关注
实名认证
内容提供者

该用户很懒,什么也没介绍

1亿VIP精品文档

相关文档