Data Lake

Back to glossary
A data lake is a central location that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data. Compared to a hierarchical data warehouse which stores data in files or folders, a data lake uses a different approach; it uses a flat architecture to store the data.

What type of data is supported in Data Lakes?

A data lake holds big data from many sources in a raw, granular format. It can store structured, semi-structured, or unstructured data, which means data can be kept in a more flexible format so we can transform it when we’re ready to use it. What Is a Data Lake?

What are the benefits of a Data Lake?

Each data element in a lake gets assigned a unique identifier and is tagged with a set of extended metadata tags. Whenever there is a business question raised, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. You can apply various types of analytics to your data such as SQL queries, big data analytics, full-text search, real-time analytics, even machine learning can be used to uncover insights. Data lakes are usually configured on a cluster of scalable commodity hardware. As a result, data can be dumped in the lake in case it will be needed at a future date without worrying about storage capacity. In addition, the clusters could exist on-premises or in the cloud. The term data lake is usually associated with Hadoop-oriented object storage.

What are Hadoop Data Lakes?

A Hadoop data lake is a data management platform which stores data in the Hadoop Distributed File System "HDFS" across a set of clustered compute nodes.  Its main usage is to process and store nonrelational data such as log files, sensor data, JSON objects, social posts, images, as well as transactional data pulled from relational databases to support analytics applications.  The use of Hadoop in relation to data systems is extremely compelling for organizations as it provides a low-cost approach to on-premise data storage. Hadoop has proven to work great even for very large organizations. However, there is caution, if there is a weak architecture and loose data governance policies, the Hadoop Data Lakes can easily become data dumping grounds turning the data lake into a data swamp. Data Swamps have no curation, including little to no active management throughout the data life cycle and little to no contextual metadata. Data Swamps have the problem of being of little use or unusable and frustrating.

Additional Resources

Back to glossary