Five Reasons to Build your Modern Data Stack on the Lakehouse with Databricks, dbt Labs and Fivetran
The Modern Data Stack (MDS) appeared several years ago as cloud-based modern data platforms put analytics - and the tools that power it - in the hands of practitioners. Gone were the days of carefully-sized Hadoop clusters running on-premise, replaced by data warehouses that could scale instantly and could connect using standard SQL to a new generation of ETL and BI tools. The lakehouse pattern is the latest and perhaps most powerful pattern to emerge in the last few years. It unifies the simplicity and scalability of data warehouses with the openness and cost advantages of data lakes. Importantly, the lakehouse pattern is strictly additive - as a data practitioner, you get the best of both worlds. In this blog, we provide five reasons to build a modern data stack on the lakehouse and what makes dbt Cloud and Fivetran on Databricks an ideal MDS solution.
Benefits of the Modern Data Stack
The Modern Data Stack offers several advantages to businesses:
- Elastic and scalable: Legacy systems are inelastic and expensive to scale. The MDS is built on cloud technologies that enable instant elasticity and usage-based pricing
- ELT, not ETL: With cloud-first technologies, ETL has evolved into ELT. Data transformations are executed in the data warehouse, benefitting from its scale and performance.
- SQL-centric: SQL is the lingua franca of analytics. The MDS enables analysts to own data pipelines instead of relying on centralized data teams with limited bandwidth. All tools that connect to the MDS speak SQL, simplifying integration.
- Focus on insights: The MDS enables data teams to focus on generating insights and knowledge, instead of toil that does not generate business value. For example, MDS users use managed connectors instead of building and maintaining their own in the face of changing APIs and source schemas.
Data warehouses do not scale to ML & AI
While the MDS paradigm brings many benefits over traditional on-premise systems, building it on legacy data warehouses has a severe shortcoming: it does not work for ML and AI workloads.
Data warehouses were never designed for ML and AI. Data warehouses are 40-year-old technology designed for one use case: fast analytical and BI queries on a large data subset. Data scientists use notebooks to explore data, write code in SQL as well as Python and other computational or scripting languages, run training and inference and take models from experimentation to deployment, including for real-time use-cases. Data warehouses simply lack the capabilities to do any of this, which means you have to buy, integrate, maintain and govern an expensive and disparate set of products.
Data warehouses don’t scale to the data needs of ML and AI practitioners. Data warehouses achieve fast query performance by storing data in a proprietary format. Setting aside the problem that you are locking yourself in with a vendor, there’s also the fact that data processing gets prohibitively expensive as it scales on data warehouses. Customers resort to copying only subsets of data to data warehouses. This is at odds with modern ML/AI, which benefits from training over all historical data.
Modern Data Stack on the Lakehouse with Databricks + dbt Cloud + Fivetran
Businesses recognize the strategic value of being data-driven, but only a few successfully deliver on their data strategy. The lakehouse has emerged as the new standard for the MDS that solves the above challenges. It helps businesses unlock many data use cases — from analytics, BI and data engineering to data science and machine learning.
Adoption of and investments in the lakehouse continue to grow. A recent Foundry report that surveys 400+ IT Leaders on the state of the data stack finds that two-thirds (66%) are using a data lakehouse, and 84% of those who aren’t are likely to consider doing so.
In this section, we tell you why the lakehouse makes the best foundation for your modern data stack and how to get started with your own MDS on the lakehouse with Databricks + dbt Cloud + Fivetran.
1. Unified and open
The Databricks Lakehouse Platform is built on the lakehouse paradigm that supports all data types and all workloads on one platform, eliminating the data silos that traditionally separate data engineering, analytics, BI, data science and machine learning. It combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes. Instead of copying and transforming data in multiple systems, analytics teams can get rid of the operational overhead with one platform where they access all the data and share a common tool stack with their data science counterparts. They also have one security and governance model eliminating data access issues for teams that need visibility into all the data assets available for analysis.
2. Built for ML and AI (including LLMs) from the ground up
Once the data pipelines that bring new datasets are established, organizations want to move to forward-looking use cases such as ML and AI onto their MDS. In fact, ChatGPT disrupted everything with thousands or organizations making generative AI their single biggest technological shift (and boardroom priority). The need to sync data between different systems to bring organizational-wide, high-quality data together has never been greater.
Databricks Lakehouse is designed to allow any data persona to get started with Large Language Models, including the world’s first open source LLM Dolly, to build and use language models in MDS applications. This means the ML team and the analytics engineer can collaborate on the same data sets to easily and affordably apply AI to real-world problems, allowing them to make better decisions for the business.
Databricks’ foundational ML lifecycle capabilities such as automated cluster management, feature store and collaborative notebooks have saved companies millions of dollars while realizing productivity gains that make them efficient. For example,
- CONA Services, a Coca Cola Company, uses Databricks for the full ML lifecycle to optimize supply chain for hundreds of thousands of stores to reap $6M+ in savings.
- Amgen improves data science collaboration to accelerate drug discovery, saving $50M+ in operational costs.
- Via leverages machine learning to accurately forecast demand, reducing compute costs by 25% and saving R$3.9M due to increased productivity.
3. Streaming for business critical use cases
Organizations are collecting large amounts of system-generated data from sensors, web, etc. that provide enormous strategic value. It is very difficult and costly to process this data in a legacy data warehouse (especially for AI and ML workloads). This is where the lakehouse shines! The Databricks Lakehouse Platform is built on Spark Structured Streaming, Apache Spark’s scalable and fault-tolerant stream processing engine, to process streaming data at scale.
Staying true to the infrastructure consolidation theme of the lakehouse, data teams can run their streaming workloads on the same platform as their batch workloads. The growth of streaming workloads on Databricks is staggering with the weekly number of streaming jobs growing from thousands to millions over a period of three years - a rate that is still accelerating. The Databricks Lakehouse Platform makes the transition to real-time processing from batch much simpler, lowering the cost of operations and improving the TCO of your MDS.
4. Industry leading price-performance
Databricks has developed industry-leading data warehousing capabilities directly on data lakes, bringing the best of both worlds in one data lakehouse architecture. Databricks SQL, a serverless data warehouse that lets you run all SQL and BI applications at scale with up to 12x better price/performance than traditional cloud data warehouses. Analytics teams can choose to query, find and share insights with native connectors to the most popular BI tools like Tableau, Power BI and Looker or use its built-in SQL editor, visualizations and dashboards. Check out demos and success stories here.
Databricks SQL includes Photon, the next-generation engine on the Databricks Lakehouse Platform, that provides extremely fast query performance at a low cost offering analytics teams up to 3-8X faster interactive workloads at 1/5 compute cost for ETL and 30% average TCO savings.
Databricks has optimizations that speed up query performance and improve TCO so data teams can iterate and get to business value faster. It also automatically scales the system for more concurrency. The availability of serverless compute for Databricks SQL (DBSQL) enables every analyst and analytics engineer to ingest, transform and query the most complete and freshest data without having to worry about the underlying infrastructure.
5. A vibrant and growing ecosystem of partners
The Databricks Lakehouse Platform provides connectivity to a vast ecosystem of data and AI tools. This includes native product integrations with dbt Cloud and Fivetran for automated ELT solutions upstream of analytics and ML in the lakehouse.
Fivetran provides a secure, scalable, real-time data integration solution on the Databricks Lakehouse Platform. Its 300+ built-in connectors to databases, SaaS applications, events and files automatically integrate data in a normalized state into Delta Lake. Fivetran’s low-impact log-based change data capture (CDC) makes it easy to replicate on-prem and cloud databases in real-time for fast, continuous data delivery in the lakehouse.
With Databricks Partner Connect, analytics teams can connect instantly to dbt Cloud, and build production-grade data transformations directly on the lakehouse. Analytics engineers can simplify access to all their data, collaboratively explore, transform and query the freshest data in place on top of a unified, open and scalable lakehouse platform for all analytics and AI workloads.
Customer highlight
Condé Nast serves up multimedia content on a global scale
Like many large enterprises, Condé Nast stored its data in siloed systems. As it planned its global expansion, Condé Nast realized its data architecture was too complex to give the company the scalability it needed.
Condé Nast implemented dbt Cloud and Fivetran alongside Databricks Lakehouse to give all data teams access to the same data sets. The company now enables data warehouse engineers to build data models quickly for analytics, machine learning applications and reporting.
“With dbt Cloud and Databricks Lakehouse, our data scientists who build personalization models and churn models are finally using the same data sets that our marketers and analysts use for activation and business insights,” reported said Nana Essuman, Senior Director of Data Engineering & Data Warehouse, Condé Nast. “This has dramatically increased our productivity while decreasing dependency on data engineers. It’s also much easier to monitor and control the costs of our entire data infrastructure because it’s all running on one platform.”
Read the full customer story here. More information about the Fivetran workflow is here.
Unlock modern data workloads with Databricks, dbt Cloud and Fivetran
As seen here, the lakehouse serves as the best home for the modern data stack. Databricks, dbt Cloud and Fivetran simplify your modern data stack with a unified approach that eliminates the data silos that separate and complicate data engineering, analytics, BI, data science and machine learning.
Hear from the co-founders of Databricks, Fivetran and dbt Labs why the lakehouse is the right data architecture for all your data and AI use cases. Register now and get a $100 credit toward Databricks certifications.
Build your own modern data stack by integrating Fivetran and dbt Cloud with Databricks. We also have a demo project for Marketing Analytics Solution Using Fivetran and dbt on the Databricks Lakehouse. If you want to learn more about dbt Cloud on Databricks, try the dbt with Databricks step-by-step training.