Data Lakes Introduction – Databricks

Introduction to Data Lakes

Data lakes provide a complete and authoritative data store that can power data analytics, business intelligence, and machine learning

What is a data lake?

A data lake is a centralized data repository that is capable of storing both traditional structured (row and column) data, as well as unstructured, non-tabular raw data in its native format (like videos, images, binary files, and more.)‍ Data Lakes leverage inexpensive object storage and open formats to enable many applications to take advantage of the data.

Data lakes are often used to consolidate all of an organization’s data in a single, central location, where it can be saved “as is,” without the need to impose a schema or structure on it up front. Data in all stages of the refinement process can be stored in a data lake: raw data can be ingested and stored right alongside an organization’s structured, tabular data sources (like database tables), as well as intermediate data tables generated in the process of refining raw data. Unlike most databases, data lakes can process all data types including images, video, audio and text.

Why do you need a data lake?

Today, companies have lots of data, but it’s often isolated and siloed away in different storage systems: data warehouses, databases, and other storage systems across the enterprise. A data lake breaks down these data silos, centralizing and consolidating all of your organization’s batch and streaming data assets into a complete and authoritative data store for analytics that is always up to date. Unifying all of your data in a data lake is the first step for companies that aspire to harness the power of machine learning and data analytics to win in the next decade.

A data lake’s flexible, unified architecture opens up a wide range of new use cases for cross-functional enterprise scale analytics, BI, and machine learning projects that can unlock massive business value. Data analysts can harvest rich insights by querying the data lake using SQL, data scientists can join and enrich data sets to generate ML models with ever greater accuracy, data engineers can build automated ETL pipelines, and business intelligence analysts can create visual dashboards and reporting tools faster and easier than before. These use cases can all be performed on the data lake simultaneously, without lifting and shifting the data, even while new data is streaming in.

When properly architected, data lakes enable the ability to:

Power data science and machine learning.

Data lakes allow you to transform raw data into structured data that is ready for SQL analytics, data science, and machine learning with low latency. Raw data can be retained indefinitely at low cost for future use in machine learning and analytics.

Centralize, consolidate, and catalogue your data.

A centralized data lake eliminates problems with data silos (like data duplication, multiple security policies, and difficulty with collaboration), offering downstream users a single place to look for all sources of data.

Quickly and seamlessly integrate diverse data sources and formats.

Any and all data types can be collected and retained indefinitely in a data lake, including batch and streaming data, video, image, and binary files, and more. And since the data lake provides a landing zone for new data, it is always up to date.

Democratize your data by offering users self-service tools.

Data lakes are incredibly flexible, enabling users with completely different skills, tools, and languages to perform different analytics tasks all at once.

Data lakes vs. Data warehouses

General Characteristics

 

  1. Primary types of data
    Cost
    Scalability
    Intended users
    Vendor lock-in
    Advantages
    Disadvantages
  2. Data lake
    All types: Structured data, semi-structured data, unstructured (raw) data
    $
    Scales to hold any amount of data at low cost, regardless of type
    Data analysts, data scientists
    No
    Low cost, flexibility, scalability, allows storage of the raw data needed for machine learning
    Exploring large amounts of raw data can be difficult without tools to organize and catalog the data
  3. Data warehouse
    Structured data only
    $$$
    Scaling up becomes exponentially more expensive due to vendor costs
    Data analysts
    Yes
    User interface is familiar to users of traditional databases
    Expensive, always-on architecture, proprietary software, cannot hold unstructured (raw) data needed for machine learning

History and evolution of data lakes

As the fields of big data analytics and data science have evolved, so too have the data architectures that supported them. In the modern era, the data lake has emerged as an attractive data architecture for companies looking to collect and retain the raw data needed for next-generation data analytics, business intelligence, and machine learning.

Learn about the evolution of data lakes

Data lake challenges

Data reliability

Without the proper tools in place, data lakes can suffer from data reliability issues that make it difficult for data scientists and analysts to reason about the data. These issues can stem from difficulty combining batch and streaming data, data corruption, and other factors.

Query performance

As the size of the data in a data lake increases, the performance of traditional query engines has traditionally gotten slower. Some of the bottlenecks include metadata management, improper data partitioning, and others.

Learn more about common data lake challenges

Data lake best practices

Use the data lake as a landing zone for all of your data

Save all of your data into your data lake without transforming or aggregating it to preserve it for machine learning and data lineage purposes.
Learn More

Secure your data lake with role- and view-based access controls

Adding view-based ACLs (access control levels) enables more precise tuning and control over the security of your data lake than role-based controls alone.
Learn More

Build reliability and performance into your data lake by using Delta Lake

The nature of big data has made it difficult to offer the same level of reliability and performance available with databases until now. Delta Lake brings these important features to data lakes.
Learn More

Catalog the data in your data lake

Use data catalog and metadata management tools at the point of ingestion to enable self-service data science and analytics.
Learn More

Data lake tools and frameworks


Apache Spark™ is the de facto open source big data processing engine, enabling SQL queries and rapid distributed processing of the data in your data lake. Learn more about running Spark on Databricks.

The Databricks Unified Data Analytics Platform makes it easy to run SQL queries on your data lake, do massive scale data engineering and collaborative data science. 
Try Databricks for free today.

Simplify and strengthen your data architecture by using Delta Lake to ensure data validity and consistent views at petabyte scale. Learn more about Delta Lake.

Amazon Web Services’ Simple Storage Service (S3) provides cost effective object storage for data lakes. Learn more about building a data lake using Amazon Web Services and S3.

Presto was originally created by Facebook to run queries on Hadoop data warehouses. It can be used to run SQL queries on data lakes at scale. 
Learn more about Presto.

Customer Stories

Comcast’s Journey to Building an Agile Data and AI Platform at Scale

The Databricks Unified Data Analytics Platform enables Comcast to build rich datasets at a massive scale, optimize machine learning at scale, streamline workflows across teams, foster collaboration, reduce infrastructure complexity, and deliver superior customer experiences.

“If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

James Dixon

Databricks can help you build a reliable data lake for all your analytics needs, including data science, machine learning, and business intelligence.

Learn More