Introduction to Data Lakes
Data lakes provide a complete and authoritative data
store that can power data analytics, business intelligence,
and machine learning
To build a successful lakehouse, organizations have turned to Delta Lake, an open format data management and governance layer that combines the best of both data lakes and data warehouses. Across industries, enterprises are leveraging Delta Lake to power collaboration by providing a reliable, single source of truth. By delivering quality, reliability, security and performance on your data lake — for both streaming and batch operations — Delta Lake eliminates data silos and makes analytics accessible across the enterprise. With Delta Lake, customers can build a cost-efficient, highly scalable lakehouse that eliminates data silos and provides self-serving analytics to end-users.
Types of dataCostFormatScalabilityIntended usersReliabilityEase of usePerformance
Data lakeAll types: Structured data, semi-structured data, unstructured (raw) data$Open formatScales to hold any amount of data at low cost, regardless of typeLimited: Data scientistsLow quality, data swampDifficult: Exploring large amounts of raw data can be difficult without tools to organize and catalog the dataPoor
Data lakehouseAll types: Structured data, semi-structured data, unstructured (raw) data$Open formatScales to hold any amount of data at low cost, regardless of typeUnified: Data analysts, data scientists, machine learning engineersHigh quality, reliable dataSimple: Provides simplicity and structure of a data warehouse with the broader use cases of a data lakeHigh
Data warehouseStructured data only$$$Closed, proprietary formatScaling up becomes exponentially more expensive due to vendor costsLimited: Data analystsHigh quality, reliable dataSimple: Structure of a data warehouse enables users to quickly and easily access data for reporting and analyticsHigh
Use the data lake as a landing zone for all of your data
Save all of your data into your data lake without transforming or aggregating it to preserve it for machine learning and data lineage purposes.
Mask data containing private information before it enters your data lake
Personally identifiable information (PII) must be pseudonymized in order to comply with GDPR and to ensure that it can be saved indefinitely
Secure your data lake with role- and view-based access controls
Adding view-based ACLs (access control levels) enables more precise tuning and control over the security of your data lake than role-based controls alone.
Build reliability and performance into your data lake by using Delta Lake
The nature of big data has made it difficult to offer the same level of reliability and performance available with databases until now. Delta Lake brings these important features to data lakes.
Catalog the data in your data lake
Use data catalog and metadata management tools at the point of ingestion to enable self-service data science and analytics
Shell has been undergoing a digital transformation as part of our ambition to deliver more and cleaner energy solutions. As part of this, we have been investing heavily in our data lake architecture. Our ambition has been to enable our data teams to rapidly query our massive data sets in the simplest possible way. The ability to execute rapid queries on petabyte scale data sets using standard BI tools is a game changer for us.
—Dan Jeavons, GM Data Science, Shell