Enabling the Customer Data Platform with Databricks ETL Support

Tackle large and complex data processing challenges with the Databricks platform to surface actionable insights to your Marketing team

Published: April 11, 2023

by Justin Fenton, Brian Shumsky, Jason Meketa and Bryan Smith

Customer Data Platforms (CDPs) play an increasingly important role in the enterprise marketing landscape. By bringing together data from a wide variety of internal and external sources to construct a 360-degree view of the customer around a shared understanding of customer identity, the CDP enables marketers to develop rich insights to drive targeted engagement.

More narrowly focused than general-purpose data platforms, the CDP provides native support for the ingestion of frequently employed data sources and common transformations intended to turn raw data into informational assets ready for consumption by marketing teams. This built-in functionality helps accelerate time to value but may feel a bit constrained when teams are challenged to tackle more complex data transformation challenges. This is when marketing teams may turn to their data engineers, and those data engineers turn to their preferred data processing platform, Databricks.

The Right Tool for the Right Job

Databricks has long been recognized for its ability to tackle large, complex data processing challenges. With its support for both structured and unstructured data sources, high degree of extensibility and ease of integration with open source technologies, blurring of boundaries between real-time and periodic, batch processing and careful attention to workload management, Databricks has transformed how organizations think about analytic data processing.

This might appear to position Databricks as a rival to the CDP. Both systems support the processing of data, the generation of insights through analytics and the delivery of data and insights to downstream systems. But in our vision of a modern marketing ecosystem, we see the CDP and Databricks as complementary systems best fit for specific tasks that when properly integrated can help organizations maximize the potential of their customer information assets and minimize costs.

Complex Data Abound

Returning to the idea of complex data processing challenges in the CDP landscape, consider the processing of product reviews, social media content, clickstream data, or airline bookings with nested arrays of values. All of these information sources originate from customers and can provide valuable insights the marketing team can leverage to drive better engagement. But the data volumes involved (sometimes billions or trillions) and the complexity of the data e.g. XML, JSON or semi-structured text, are such that they must be carefully digested before they become useful to marketers.

By flowing these data through Databricks, Data Engineers can bring the full power of the lakehouse platform to bear. Product feedback can be tagged for sentiment and tone and topics can be extracted. Images can be interpreted and products in view can be identified. Individual clicks can be condensed to summary information that captures the flow of a customers' recent visit to a website. And Airline booking data in XML can be unpacked to neatly tie revenue to multiple individuals on the reservation. This information can then flow from Databricks into the CDP where marketers use these details to determine who to engage and in what manner without having to wade through an ocean of raw data. Those are still preserved in the Databricks environments for analysts and data scientists who will have use for the data in its original, unaltered form.

The Lakehouse Unlocks Insights for CDPs

To demonstrate how the Databricks lakehouse might assist a CDP with this kind of ETL-offload, we partnered with our friends at Amperity around a scenario where customer data in the Amperity CDP is used to drive a targeted email campaign. The campaign is executed via the Salesforce Marketing Cloud (SFMC) where customer segments and individual consumer email addresses are pushed from Amperity to the SFMC platform. Scheduled jobs send messages to targeted individuals and SFMC captures details about which emails were delivered, opened and clicked-through or otherwise bounced or triggered an unsubscribe request.

Details of these email message events, which can run in the billions of records in just a few weeks, are captured by SFMC and are made accessible to the marketer by a daily extract. Instead of feeding this high-volume data directly into Amperity, it's processed via Databricks, allowing for the capture of detailed information from ongoing email marketing campaigns while limiting the details flowing back. The customer 360-view housed in Amperity now has just those bits of information needed to understand the customer journey and define the next round of engagement.

Want to see this process in action, please check out the accompanying notebooks where we capture the Databricks process along with the Salesforce and Amperity integrations that surround it. We hope this demonstration helps our customers envision their own ETL offload scenarios within which Databricks can assist them in best achieving their customer engagement scenarios.

What's next?

October 1, 2024/5 min read

From Generalists to Specialists: The Evolution of AI Systems toward Compound AI

November 27, 2024/6 min read

The Right Tool for the Right Job

Complex Data Abound

The Lakehouse Unlocks Insights for CDPs

Never miss a Databricks post

Sign up

What's next?

From Generalists to Specialists: The Evolution of AI Systems toward Compound AI

How automated workflows are revolutionizing the manufacturing industry