Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Session hashtag: #SAISDev15
Burak Yavuz is a Software Engineer and Apache Spark committer at Databricks. He has been developing Structured Streaming and Delta Lake to simplify the lives of Data Engineers. Burak received his MS in Management Science & Engineering at Stanford and his BS in Mechanical Engineering at Bogazici University, Istanbul.