开发者最佳实践日-Spark-Ecosystem.pdf

Spark Ecosystem Internals 陈 超 @CrazyJvm 开发者最佳实践日@北京3W咖啡 Show of Hands How familiar are you with Spark? A. Heard of it, but havent used it before. B. Kicked the res with some basics. C. Worked or working on a proof-of-concept deployment. D. Worked or working on a producon deployment. outline •  basis internals •  ecosystem Current Major Release •  released Spark 1.2 Spark : What Why •  Apache Spark is a fast and general engine for large-scale data processing. •  Speed •  Ease of Use •  Generality •  Integrated with Hadoop Hadoop Data Sharing Spark Data Sharing DAG in-memory Why Spark Fast? •  Memory based computaon •  DAG •  Thread Model •  Opmizaon(e.g. delay scheduling) BDAS one stack to rule them all Key Concept-RDD •  A list of parons •  A funcon for compung each split •  A list of dependencies on other RDDs •  Oponally, a Paroner for key-value RDDs •  Oponally, a list of preferred locaons to compute each split on Immutable!!! Key Concept-Lineage unroll paron safely when caching Key Concept-Dependency Key Concept-ClusterManager •  Local •  Standalone •  Yarn •  Mesos Cluster Overview Schedule Executor Shuffle Sort-based shuffle supported Shuffle •  Pull-based (not push-based) •  Write intermediate files to disk •  Build hash map within each paron •  Can spill across keys •  A single key-value pair must fit in memory Beer Metrics System •  Previously: only collect aer task completed •  Now : report when task

文档评论(0)

1亿VIP精品文档

相关文档