提取-分层转换-加载(ETTL)是一个模块化、可扩展和可观察的内部分析平台的管道-.pptxVIP

  • 4
  • 0
  • 约3.8千字
  • 约 74页
  • 2019-11-18 发布于广东
  • 举报

提取-分层转换-加载(ETTL)是一个模块化、可扩展和可观察的内部分析平台的管道-.pptx

A pipeline for a modular, scalable observable Internal Analytics platform ;Agenda;WHAT we do and how;Monitoring platform for cloud-scale infrastructures and applications;;;;;Internal Analytics team;Data Engineers + Data Analysts;Our data;Our data;;How other teams access our data;WHY ETTL: Challenges Requirements;;Evolving data sources = Changing the whole pipeline Low resilience to changes in data sources Cleaning and transformations duplicated Parameters and functions all in one giant utils One task fails → whole pipeline fails Backfilling can take forever Tasks dependencies nightmare DataOps very complex;E T T L;Bronze;Data sources;Bronze = Data gathering;customer_hourly_usage customer_id 22 name DatacatX billing_plan_id 3 creation_tstamp 2018-01-14 10:15:34.171 hour_data 2018-10-24 15:00:00 server_count 154 country France;S3 Bronze bucket;Silver = Normalization layer;customer_hourly_usage customer_id 22 name DatacatX billing_plan_id 3 creation_tstamp 2018-01-14 10:15:34.171 hour_data 2018-10-24 15:00:00 server_count 154 country France ;Gold = Analytics layer;fact_usage_daily customer_id 22 int customer_name datacatx string billing_plan pro string usage_date 2018-10-24 date is_in_trial false boolean server_count_max 175 int server_count_avg 141.5 float;Load to data warehouse;Bronze;;31; The orchestra maestro;Holds logic between the tasks and chains them together based on a dependency graph;import luigi class MyTask(luigi.Task): param = luigi.Parameter(default=42) def requires(self): return OtherTask(self.param) def run(self): f = self.output().open(w) print f, hello world f.close() def output(self): return luigi.S3Target(s3://bucket/folder/file-%s.csv % self.param) if __nam

文档评论(0)

1亿VIP精品文档

相关文档