A data lake is a centralized data repository that is capable of storing both traditional structured (row and column) data, as well as unstructured, non-tabular raw data in its native format (like videos, images, binary files, and more.) Data Lakes leverage inexpensive object storage and open formats to enable many applications to take advantage of the data.
Data lakes are often used to consolidate all of an organization’s data in a single, central location, where it can be saved “as is,” without the need to impose a schema or structure on it up front. Data in all stages of the refinement process can be stored in a data lake: raw data can be ingested and stored right alongside an organization’s structured, tabular data sources (like database tables), as well as intermediate data tables generated in the process of refining raw data. Unlike most databases, data lakes can process all data types including images, video, audio and text.
Today, companies have lots of data, but it’s often isolated and siloed away in different storage systems: data warehouses, databases, and other storage systems across the enterprise. A data lake breaks down these data silos, centralizing and consolidating all of your organization’s batch and streaming data assets into a complete and authoritative data store for analytics that is always up to date. Unifying all of your data in a data lake is the first step for companies that aspire to harness the power of machine learning and data analytics to win in the next decade.
A data lake’s flexible, unified architecture opens up a wide range of new use cases for cross-functional enterprise scale analytics, BI, and machine learning projects that can unlock massive business value. Data analysts can harvest rich insights by querying the data lake using SQL, data scientists can join and enrich data sets to generate ML models with ever greater accuracy, data engineers can build automated ETL pipelines, and business intelligence analysts can create visual dashboards and reporting tools faster and easier than before. These use cases can all be performed on the data lake simultaneously, without lifting and shifting the data, even while new data is streaming in.
When properly architected, data lakes enable the ability to:
Data lakes allow you to transform raw data into structured data that is ready for SQL analytics, data science, and machine learning with low latency. Raw data can be retained indefinitely at low cost for future use in machine learning and analytics.
A centralized data lake eliminates problems with data silos (like data duplication, multiple security policies, and difficulty with collaboration), offering downstream users a single place to look for all sources of data.
Any and all data types can be collected and retained indefinitely in a data lake, including batch and streaming data, video, image, and binary files, and more. And since the data lake provides a landing zone for new data, it is always up to date.
Data lakes are incredibly flexible, enabling users with completely different skills, tools, and languages to perform different analytics tasks all at once.
As the fields of big data analytics and data science have evolved, so too have the data architectures that supported them. In the modern era, the data lake has emerged as an attractive data architecture for companies looking to collect and retain the raw data needed for next-generation data analytics, business intelligence, and machine learning.
Without the proper tools in place, data lakes can suffer from data reliability issues that make it difficult for data scientists and analysts to reason about the data. These issues can stem from difficulty combining batch and streaming data, data corruption, and other factors.
As the size of the data in a data lake increases, the performance of traditional query engines has traditionally gotten slower. Some of the bottlenecks include metadata management, improper data partitioning, and others.
Save all of your data into your data lake without transforming or aggregating it to preserve it for machine learning and data lineage purposes.
Adding view-based ACLs (access control levels) enables more precise tuning and control over the security of your data lake than role-based controls alone.
The nature of big data has made it difficult to offer the same level of reliability and performance available with databases until now. Delta Lake brings these important features to data lakes.
Use data catalog and metadata management tools at the point of ingestion to enable self-service data science and analytics.