Introduction to Data Lakes

Data lakes provide a complete and authoritative data
store that can power data analytics, business intelligence,
and machine learning

Building a lakehouse with Delta Lake

To build a successful lakehouse, organizations have turned to Delta Lake, an open format data management and governance layer that combines the best of both data lakes and data warehouses. Across industries, enterprises are leveraging Delta Lake to power collaboration by providing a reliable, single source of truth. By delivering quality, reliability, security and performance on your data lake — for both streaming and batch operations — Delta Lake eliminates data silos and makes analytics accessible across the enterprise. With Delta Lake, customers can build a cost-efficient, highly scalable lakehouse that eliminates data silos and provides self-serving analytics to end-users.

Data lakes vs. Data lakehouses vs. Data warehouses

  1. Types of data
    Intended users
  2. Data lake
    All types: Structured data, semi-structured data, unstructured (raw) data
    Open format
    Scales to hold any amount of data at low cost, regardless of type
    Limited: Data scientists
    Low quality, data swamp
    Difficult: Exploring large amounts of raw data can be difficult without tools to organize and catalog the data
  3. Data lakehouse
    All types: Structured data, semi-structured data, unstructured (raw) data
    Open format
    Scales to hold any amount of data at low cost, regardless of type
    Unified: Data analysts, data scientists, machine learning engineers
    High quality, reliable data
    Simple: Provides simplicity and structure of a data warehouse with the broader use cases of a data lake
  4. Data warehouse
    Structured data only
    Closed, proprietary format
    Scaling up becomes exponentially more expensive due to vendor costs
    Limited: Data analysts
    High quality, reliable data
    Simple: Structure of a data warehouse enables users to quickly and easily access data for reporting and analytics

Lakehouse best practices

Use the data lake as a landing zone for all of your data

Save all of your data into your data lake without transforming or aggregating it to preserve it for machine learning and data lineage purposes.

Mask data containing private information before it enters your data lake

Personally identifiable information (PII) must be pseudonymized in order to comply with GDPR and to ensure that it can be saved indefinitely

Secure your data lake with role- and view-based access controls

Adding view-based ACLs (access control levels) enables more precise tuning and control over the security of your data lake than role-based controls alone.

Build reliability and performance into your data lake by using Delta Lake

The nature of big data has made it difficult to offer the same level of reliability and performance available with databases until now. Delta Lake brings these important features to data lakes.

Catalog the data in your data lake

Use data catalog and metadata management tools at the point of ingestion to enable self-service data science and analytics

Shell has been undergoing a digital transformation as part of our ambition to deliver more and cleaner energy solutions. As part of this, we have been investing heavily in our data lake architecture. Our ambition has been to enable our data teams to rapidly query our massive data sets in the simplest possible way. The ability to execute rapid queries on petabyte scale data sets using standard BI tools is a game changer for us.

—Dan Jeavons, GM Data Science, Shell

Read the full story →