New features help data teams streamline reliable data pipelines and easily discover and govern enterprise data assets across multiple clouds and data platforms
SAN FRANCISCO — May 26, 2021 — Today, at the Data + AI Summit, Databricks announced two new innovations that enhance its lakehouse platform through reliability, governance and scale. First, the company revealed Delta Live Tables to simplify the development and management of reliable data pipelines on Delta Lake. The company also announced Unity Catalog, a new, unified data catalog that makes it easy to discover and govern all of an organization’s data assets, with a complete view of data across clouds and existing catalogs. The Unity Catalog is underpinned by Delta Sharing, a new open source protocol for secure data sharing also announced by Databricks today. It allows organizations to use Unity Catalog to also manage secure data sharing with business partners and data exchanges, further emphasizing the flexibility provided by an open lakehouse platform.
Delta Live Tables: Building the foundation of the lakehouse with reliable data pipelines
Delta Live Tables is a cloud service in the Databricks platform that makes ETL – extract, transform and load capabilities – easy and reliable on Delta Lake to help ensure data is clean and consistent when used for analytics and machine learning.
Today, building reliable ETL pipelines at scale is a difficult challenge for enterprises. Poor reliability leads to missing or incorrect data in business-critical systems, often resulting in costly errors for the organization. The process to build pipelines is highly manual today, requiring very granular work to both define the instructions for how data should be manipulated and how the accuracy of those manipulations should be tested. Also, as the number of pipelines grows in response to more and more data being gathered and used, managing and updating pipelines becomes a heavy operational burden.
Delta Live Tables solves this challenge by abstracting away the low-level instructions, removing many potential sources of error. Instead of requiring a data engineer to explain how every step of a pipeline should work, with Delta Live Tables, they only specify the outcomes the pipeline needs to achieve using high-level languages like SQL. Delta Live Tables then automatically creates the instructions for both the data transformations and the data validations, as well as implementing uniform error handling. Managing pipelines at scale is improved through chained dependencies that automatically execute downstream changes when a table is modified. Additionally, Delta Live Tables is able to restart pipelines to resolve transient errors. If the failure requires manual intervention, or if new business logic requires changes to the data, Delta Live Tables makes it easy for data engineering teams to pinpoint the source of the error for fast remediation of the issue and then reprocess data from that location.
“At Shell, we are aggregating all of our sensor data into an integrated data store – working at the multi-trillion record scale. Delta Live Tables has helped our teams save time and effort in managing data at this scale. We have been focusing on continuously improving our AI engineering capability and have an Integrated Development Environment (IDE) with a graphical interface supporting our Extract Transform Load (ETL) work. With this capability augmenting the existing lakehouse architecture, Databricks are disrupting the ETL and data warehouse market which is important for companies like ours. We are excited to continue to work with Databricks as an innovation partner.”
Unity Catalog: Simplified governance of data and AI across multiple cloud platforms
Today, the vast majority of data within enterprises is flowing into cloud-based data lakes. But data lakes present significant governance challenges. First, cloud providers don’t offer fine-grained access controls. Privileges stop at the file-level, rather than the contents of the file, making access an all or nothing proposition. The only way around this is to copy subsets of a file’s data into new files, and this proliferation of files is one of the major reasons why data lakes become data swamps. With multi-cloud adoption on the rise, the problem gets even harder, because each cloud provider has a different set of APIs for managing access. Second, the world has moved beyond simply trying to govern well-structured data. Modern data assets take many forms, including dashboards, machine learning models, and unstructured data like video and images that legacy data governance solutions simply weren’t built to manage.
Unity Catalog addresses these challenges by providing one interface to provide fine-grained governance for all data assets, both structured and unstructured, across all cloud data lakes to make it easier for enterprises to unify their data on the Databricks Lakehouse Platform. Unity Catalog is based on industry standard ANSI SQL to streamline implementation and standardize governance across clouds. Unity Catalog also integrates with existing data catalogs to allow organizations to build on what they already have and establish a future proof and centralized governance model without expensive migration costs. Already, strategic Databricks partners like Alation, Collibra, Immuta and Privacera have committed to contribute to an ecosystem of powerful integrations for Unity Catalog.
For more information about the introduction of Unity Catalog, please visit: https://databricks.com/product/unity-catalog. And, to learn more about Delta Live Tables, which is now available for customers to preview, visit: https://databricks.com/p/product-delta-live-tables
Databricks is the data and AI company. More than 5,000 organizations worldwide — including Comcast, Condé Nast, H&M, and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Delta Lake, Apache Spark™, and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on Twitter, LinkedIn and Facebook.