开发者最佳实践日－Spark-Ecosystem.pdf

下载文档 降价啦

3
0
约4.01千字
约 42页
2019-02-01 发布于山东
举报
保障服务

开发者最佳实践日－Spark-Ecosystem.pdf

Spark Ecosystem Internals 陈超 @CrazyJvm 开发者最佳实践日@北京3W咖啡 Show of Hands How familiar are you with Spark? A. Heard of it, but havent used it before. B. Kicked the res with some basics. C. Worked or working on a proof-of-concept deployment. D. Worked or working on a producon deployment. outline •  basis internals •  ecosystem Current Major Release •  released Spark 1.2 Spark : What Why •  Apache Spark is a fast and general engine for large-scale data processing. •  Speed •  Ease of Use •  Generality •  Integrated with Hadoop Hadoop Data Sharing Spark Data Sharing DAG in-memory Why Spark Fast? •  Memory based computaon •  DAG •  Thread Model •  Opmizaon(e.g. delay scheduling) BDAS one stack to rule them all Key Concept-RDD •  A list of parons •  A funcon for compung each split •  A list of dependencies on other RDDs •  Oponally, a Paroner for key-value RDDs •  Oponally, a list of preferred locaons to compute each split on Immutable!!! Key Concept-Lineage unroll paron safely when caching Key Concept-Dependency Key Concept-ClusterManager •  Local •  Standalone •  Yarn •  Mesos Cluster Overview Schedule Executor Shuﬄe Sort-based shuﬄe supported Shuﬄe •  Pull-based (not push-based) •  Write intermediate ﬁles to disk •  Build hash map within each paron •  Can spill across keys •  A single key-value pair must ﬁt in memory Beer Metrics System •  Previously: only collect aer task completed •  Now : report when task

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

开发者最佳实践日－Spark-Ecosystem.pdf