Data lake

Glossary Item
« Back to Glossary Index
Source Databricks

A data lake is a central location,  that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data.

Compared to a hierarchical data warehouse which stores data in files or folders, a data lake uses a different approach; it uses a flat architecture to store the data.

Data Lakes Support All Data Types

A data lake holds big data from many sources in a raw, granular format. It can store structured, semi-structured, or unstructured data, which means data can be kept in a more flexible format so we can transform it when we’re ready to use it.

what is a data lake

Benefits of a Data Lake

Each data element in a lake gets assigned a unique identifier and is tagged with a set of extended metadata tags. Whenever there is a business question risen, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

You can apply various types of analytics to your data such as SQL queries, big data analytics, full-text search, real-time analytics, even machine learning can be used to uncover insights.

Data lakes are usually configured on a cluster of scalable commodity hardware. As a result, data can be dumped in the lake in case it will be needed at a future date without worrying about storage capacity. In addition, the clusters could exist on-premises or in the cloud.

The term data lake is usually associated with Hadoop-oriented object storage.

Hadoop Data Lakes

The use of Hadoop in relation to data systems is extremely compelling as it provides a low-cost approach to data storage. Hadoop has proven to work great even for very large organizations.

A Hadoop data lake is a data management platform which stores data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes

Its main usage is to process and store nonrelational data. Some of the types of data that can be processed are log files, internet clickstream records, sensor data, JSON objects, images, and social media posts.

« Back to Glossary Index