Build Reliable Data Lakes with Open Delta Lake

Create a central source of truth for data science, machine learning, and analytics

This overview walks users through how to collect data into a data lake to serve different data use cases. Learn how to ingest data into your data lake, manage its ETL and security, and enable downstream data teams access for ML and BI.

The Challenge

BEFORE

  • Data silos from traditional data warehouse not handling unstructured data, additional systems needed
  • Complexity and cost of transferring data between multiple disparate data systems
  • Proprietary data formats prevent direct data access with other tools and increases lock-in risk
  • Non-SQL use cases require new copies of data for data science and machine learning
  • Performance bottlenecks with data throughput slowing down data team agility and productivity
  • Increased cost & governance challenges managing multiple copies of data and security models

The Solution

AFTER

  • Modern data lakes handling all structured and unstructured data in a central repository
  • Cost effective pipelines to progressively refine reliable data through data lake tables
  • Open data formats ensure data is accessible across all tools and teams, reducing lock-in risk
  • SQL and ML together on your data lake with a single copy of data
  • Fast data for downstream streaming analytics, data science exploration, and model training
  • Build once, access many times across use cases for a consolidated administration and self-service

Build and scale reliable data lakes

Start with an open data strategy and open technologies

Your data is a strategic asset. Whether for decision making, key business processes, or direct revenue generation, data should be managed carefully. The last thing you want is to have it locked inside a proprietary system or a closed data format that leaves you vulnerable to vendor pricing, contracts, or technology decisions. Open data lakes ensure your data is always accessible, unlike traditional data warehouses. Delta Lake is an open source storage layer that adds data reliability and performance to your data lake and is built with open data formats, open APIs, and is hosted by the Linux Foundation. Data can also be directly accessed with different tools and technologies.

Learn more

Collect all the data in your company together

Data shouldn’t be siloed in applications, databases, or file storage. Start with a broad set of data ingestion capabilities to easily populate your data lake, including partner data integrations, auto loader from blob storage, idempotent copy command, and data source APIs. Leverage the right approach for your architecture to land raw operational data from your systems into your central data lake on cost-effective cloud storage, without compromising data reliability or security.

Learn more

Ensure data reliability for production data lakes

Your data lake needs to be reliable in order for it to be trusted by downstream data scientists and data analysts. Delta Lake is an open source storage layer for your existing data lake, and uses versioned Apache Parquet™ files and a transaction log to keep track of all data commits, which enables many reliability capabilities. Maintain data integrity, even with multiple data pipelines concurrently reading and writing data to your data lake, with ACID transactions. Ensure data types are correct and required columns are present with Schema Enforcement, and update these requirements over time with Schema Evolution.

Learn more

Data lifecycle management for your data lake

As your data lake grows, it becomes increasingly important to manage its data lifecycle. Update, Merge, and Delete data from your data lake with DML commands, such as for GDPR compliance when user records need to be removed from all tables, or as part of a Change Data Capture process. Revert to previous data versions with Time Travel for auditing, roll back, or reproducibility, such as for supporting the needs of downstream data teams or for ETL troubleshooting. Maintain data lake performance and hygiene with optimize and vacuum commands, and manage your data across Azure and AWS for multi-cloud strategies.

Learn more

Enterprise ready administration, controls, and security

Democratizing data access requires granular security controls and automation. Without it, platform teams are forced into a losing decision: either make data openly accessible to everyone (and risk data security), or to lock all data down (and stifle business productivity). Architect a unified cloud security posture with a broad portfolio of administrative capabilities, including IAM roles, Access Controls, Encryption, and Audit Logging. Databricks keeps data in your own cloud infrastructure account, not in a vendor owned account, ensuring your control of your data. Leverage APIs and monitoring to scale administrative workflows and operations. Databricks scales to multiple petabytes of data for sensitive business-critical use cases, with certifications including HIPAA, SOC2, PCI, and more.

Learn more

Progressively and continuously refine data

You can now leverage a “medallion” model to progressively refine your raw data into business-level aggregates as it streams between different data quality tables. Raw data initially lands into a bronze table, which is then filtered, cleaned, and augmented into a silver table. A final gold table holds business-level aggregates that can be readily accessed by business analysts and data scientists. This multi-hop data processing approach brings many benefits, including data quality checkpoints for fault recovery, simple gold table reprocessing for new business logic or data, and a continuous data flow for complete and recent data.

Learn more

Empower analysts with more complete and recent data

SQL reporting, dashboarding, and BI can now be powered by the more complete and recent data offered by data lakes, given their faster data loading, increased data type flexibility, and cost effective cloud blob storage. The “medallion” data refinement models gives data analysts the flexibility to not only consume business data aggregates, but also drill into the raw, unprocessed data for deeper analysis, such as for investigating anomalies. Data analysts previously would have to switch to other systems and technologies to access these details. BI Visualization tools like Tableau, PowerBI, Looker, or any other ODBC/JDBC compatible tool can be easily connected, and easily expand SQL analysis into data science, with the same data, platform, and governance.

Learn more

Self-service for data scientists and machine learning engineers

With complete, reliable, and secure data available in your data lake, your data teams are now ready to run exploratory data science experiments and build production ready machine learning models. Integrated cloud-based collaborative notebooks with Python, Scala, and SQL make it easy for teams to share analysis and results. Databricks Connect lets teams attach their preferred IDE or Notebook, such as IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio, Zeppelin, Jupyter, or other custom applications. And manage the end-to-end machine learning lifecycle with MLflow for experiment tracking, model registry, production deployment and more.

Learn more

Connect data applications to drive business operations

Improve key business processes with the most complete business insights into your customers, products, markets, and more. Data applications can leverage your data lake to power a wide variety of industry use cases. Whether it’s personalizing customer experiences in media, optimizing prices in retail, fighting fraud in financial services, or drug discovery in life sciences, complete and reliable data in your data lake can power dozens of different streaming streaming applications throughout your business. Delta Lake is open format and open source, meaning that your data lake can be openly accessed by all of your applications and tools for all your business needs.

Learn more

Migrate slow legacy systems to modern cloud data lake

You may already have a legacy data warehouse or an on-premise Hadoop data lake that is not able to meet the growing demands of your data teams, with issues such as complex operations, data reliability issues, or performance bottlenecks causing data initiatives to fail. Migrate to a scalable, managed cloud data platform to increase productivity, cut costs, and create more value from your data. Databricks has worked with many customers as part of their cloud journey to move workloads, transfer data, and manage change.

Learn more

Customer Stories

Comcast’s journey to building an agile data and AI platform at scale with Databricks

Learn about Comcast’s data and machine learning infrastructure built on Databricks Unified Data Analytics Platform. Comcast processes petabytes of content and telemetry data, with millions of transactions a second. Their data lake with Delta Lake is used to train their ML models and improve the customer experience of their Emmy-award winning service.

Ready to Get Started?