Max Schultze is a lead data engineer working on building a data lake at Zalando, Europe’s biggest online platform for fashion. His focus lies on building data pipelines at petabytes scale and productionizing Spark and Presto on Delta Lake inside the company. He graduated from the Humboldt University of Berlin, actively taking part in the university’s initial development of Apache Flink.
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe's biggest online fashion retailer - we realized that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh. The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership. This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture backed by Spark and build on Delta Lake, and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Zalando strives to be a fully data-driven company that utilizes AI to make decisions fast and accurately. For this reason we have built a Data Lake that contains all data of the company. To provide easy access to that data and enable the company to make use of it, we have established an internal platform that offers Databricks as a service for all departments and teams. Making Databricks Delta tables available to all clients of the Data Lake enabled them to leverage Structured Streaming and to build continuous applications on top of it. Big part of this journey was solving challenges in governance, security and access management.
In this talk we want to share our experience in productionizing and operating Databricks at scale and in making data-driven continuous applications feasible out of the box.