At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged.
Delta Lake focuses on improving the reliability and scalability of data lakes. Its higher level abstractions and guarantees, including ACID transactions and time travel, drastically simplify the complexity of real-world data engineering architecture. Since we open sourced Delta Lake six months ago, we have been humbled by the reception. The project has been deployed at thousands of organizations and processes exabytes of data each month, becoming an indispensable pillar in data and AI architectures.
To further drive adoption and grow the community, we’ve decided to partner with the Linux Foundation to leverage their platform and their extensive experience in fostering influential open source projects, ranging from Linux itself, Jenkins, and Kubernetes. We are joined by Alibaba, Booz Allen Hamilton, Intel, and Starburst in the announcement to develop Delta Lake support not just for Apache Spark, but also Apache Hive, Apache Nifi, and Presto.
Rich Feature Sets for More Robust Data Lakes
As discussed earlier, Delta Lake makes data lakes easier to work with and more robust. It is designed to address many of the problems commonly found with data lakes. For example, incomplete data ingestion can lead to corrupt data; this is addressed by Delta Lake’s ACID Transactions, including for multiple data pipelines reading and writing data concurrently to a data lake. Data sources feeding data lakes may not provide complete column data or correct data types, and so Schema Enforcement prevents bad data from causing data corruption. Change data capture and update/delete/upsert support allows non-append-only workloads to work well on data lakes, a must for GDPR/CCPA.
The list of Delta Lake’s capabilities goes on, with the overarching goal of bringing greater data reliability and scalability to data lakes, so that their data can be consumed more easily by other systems and technologies.
Data Lake Openness and Extensibility
The key tenets for Delta Lake’s design are for openness and extensibility. Delta Lake stores all the data and metadata in cloud object stores, with an open protocol design that leverages existing open formats such as JSON and Apache Parquet. This openness not only removes the risk of vendor lock-in, but is also critical in building an ecosystem to enable the myriad of different use cases from data science, machine learning, and SQL.
To ensure the project’s long-term growth and community development, we’ve worked with the Linux Foundation to further this spirit of openness.
Offen Delta Lake Governance & Community Participation
We’re excited that the Linux Foundation will now host Delta Lake as a neutral home for the project, with an open-governance model to encourage participation and technical contributions. This will help provide a framework for long-term stewardship; establish a community ecosystem invested in Delta Lake’s success; and develop an open standard for data storage in data lakes. We believe that this approach will help ensure that data stored in Delta Lake remains open and accessible, while driving increased innovation and development to solve the challenging problems in this space.
The Databricks team has created and contributed to a variety of open-source projects for the data & AI ecosystem, including Apache Spark, MLflow, Koalas, and Delta Lake. We continue to participate in the open-source community because we know it’s the fastest, most comprehensive way to bring new capabilities to market. We’ve been able to build a sustainable, healthy business, while also connecting with the community to ensure that projects don’t lock customers into proprietary systems or data formats.