History and evolution of data lakes
The early days of data management: databases
In the early days of data management, the relational database was the primary method that companies used to collect, store and analyze data. Relational databases, also known as relational database management systems (RDBMSes), offered a way for companies to store and analyze highly structured data about their customers using Structured Query Language (SQL). For many years, relational databases were sufficient for companies’ needs: the amount of data that needed to be stored was relatively small, and relational databases were simple and reliable. To this day, a relational database is still an excellent choice for storing highly structured data that’s not too big. However, the speed and scale of data was about to explode.
The rise of the internet, and data silos
With the rise of the internet, companies found themselves awash in customer data. To store all this data, a single database was no longer sufficient. Companies often built multiple databases organized by line of business to hold the data instead. As the volume of data grew and grew, companies could often end up with dozens of disconnected databases with different users and purposes.
On the one hand, this was a blessing: with more and better data, companies were able to more precisely target customers and manage their operations than ever before. On the other hand, this led to data silos: decentralized, fragmented stores of data across the organization. Without a way to centralize and synthesize their data, many companies failed to synthesize it into actionable insights. This pain led to the rise of the data warehouse.
Data warehouses are born to unite companies’ structured data under one roof
With so much data stored in different source systems, companies needed a way to integrate them. The idea of a “360-degree view of the customer” became the idea of the day, and data warehouses were born to meet this need and unite disparate databases across the organization.
Data warehouses emerged as a technology that brings together an organization’s collection of relational databases under a single umbrella, allowing the data to be queried and viewed as a whole. At first, data warehouses were typically run on expensive, on-premises appliance-based hardware from vendors like Teradata and Vertica, and later became available in the cloud. Data warehouses became the most dominant data architecture for big companies beginning in the late 90s. The primary advantages of this technology included:
- Integration of many data sources
- Data optimized for read access
- Ability to run quick ad hoc analytical queries
- Data audit, governance and lineage
Data warehouses served their purpose well, but over time, the downsides to this technology became apparent.
- Inability to store unstructured, raw data
- Expensive, proprietary hardware and software
- Difficulty scaling due to the tight coupling of storage and compute power
Apache Hadoop™ and Spark™ enable unstructured data analysis, and set the stage for modern data lakes
With the rise of “big data” in the early 2000s, companies found that they needed to do analytics on data sets that could not conceivably fit on a single computer. Furthermore, the type of data they needed to analyze was not always neatly structured — companies needed ways to make use of unstructured data as well. To make big data analytics possible, and to address concerns about the cost and vendor lock-in of data warehouses, Apache Hadoop™ emerged as an open source distributed data processing technology.
What is Hadoop?
Apache Hadoop™ is a collection of open source software for big data analytics that allows large data sets to be processed with clusters of computers working in parallel. It includes Hadoop MapReduce, the Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator). HDFS allows a single data set to be stored across many different storage devices as if it were a single file. It works hand-in-hand with the MapReduce algorithm, which determines how to split up a large computational task (like a statistical count or aggregation) into much smaller tasks that can be run in parallel on a computing cluster.
The introduction of Hadoop was a watershed moment for big data analytics for two main reasons. First, it meant that some companies could conceivably shift away from expensive, proprietary data warehouse software to in-house computing clusters running free and open source Hadoop. Second, it allowed companies to analyze massive amounts of unstructured data in a way that was not possible before. Prior to Hadoop, companies with data warehouses could typically analyze only highly structured data, but now they could extract value from a much larger pool of data that included semi-structured and unstructured data. Once companies had the capability to analyze raw data, collecting and storing this data became increasingly important — setting the stage for the modern data lake.
Early data lakes were built on Hadoop
Early data lakes built on Hadoop MapReduce and HDFS enjoyed varying degrees of success. Many of these early data lakes used Apache Hive™ to enable users to query their data with a Hadoop-oriented SQL engine. Some early data lakes succeeded, while others failed due to Hadoop’s complexity and other factors. To this day, many people still associate the term “data lake” with Hadoop because it was the first framework to enable the collection and analysis of massive amounts of unstructured data. Today, however, many modern data lake architectures have shifted from on-premises Hadoop to running Spark in the cloud. Still, these initial attempts were important as these Hadoop data lakes were the precursors of the modern data lake. Over time, Hadoop’s popularity leveled off somewhat, as it has problems that most organizations can’t overcome like slow performance, limited security and lack of support for important use cases like streaming.
Apache Spark: Unified analytics engine powering modern data lakes
Shortly after the introduction of Hadoop, Apache Spark was introduced. Spark took the idea of MapReduce a step further, providing a powerful, generalized framework for distributed computations on big data. Over time, Spark became increasingly popular among data practitioners, largely because it was easy to use, performed well on benchmark tests, and provided additional functionality that increased its utility and broadened its appeal. For example, Spark’s interactive mode enabled data scientists to perform exploratory data analysis on huge data sets without having to spend time on low-value work like writing complex code to transform the data into a reliable source. Spark also made it possible to train machine learning models at scale, query big data sets using SQL, and rapidly process real-time data with Spark Streaming, increasing the number of users and potential applications of the technology significantly.
Since its introduction, Spark’s popularity has grown and grown, and it has become the de facto standard for big data processing, in no small part due to a committed base of community members and dedicated open source contributors. Today, many modern data lake architectures use Spark as the processing engine that enables data engineers and data scientists to perform ETL, refine their data, and train machine learning models.