Collecting, storing, and processing customer event data involves unique technical challenges. It's high volume, noisy, and it constantly changes. In the past, these challenges led many companies to rely on third-party black-box SaaS solutions for managing their customer data. But this approach taught many companies a hard lesson: black boxes create more problems than they solve including data silos, rigid data models, and lack of integration to the additional tooling needed for analytics. The good news is that the pain from black box solutions ushered in today's engineering-driven era where companies prioritize centralizing data in a single, open storage layer at the center of their data stack.
Because of the characteristics of customer data mentioned above, the flexibility of the data lakehouse makes it an ideal architecture for centralizing customer data. It brings the critical data management features of a data warehouse together with the openness and scalability of a data lake, making it an ideal storage and processing layer for your customer data stack. You can read more on how the data lakehouse enhances the customer data stack here.
Delta Lake is an open source project that serves as the foundation of a cost-effective, highly scalable lakehouse architecture. It's built on top of your existing data lake–whether that be Amazon S3, Google Cloud Storage, or Azure Blob Storage. This secure data storage and management layer for your data lake supports ACID transactions and schema enforcement, delivering reliability to data. Delta Lake eliminates data silos by providing a single home for all data types, making analytics simple and accessible across the enterprise and data lifecycle.
With RudderStack moving data into and out of your lakehouse, and Delta Lake serving as your centralized storage and processing layer, what you can do with your customer data is essentially limitless.
How do you take unstructured events and deliver them in the right format, like Delta, in your data lakehouse? You could build a connector or use RudderStack's Databricks Integration to save you the trouble. RudderStack's integration takes care of all the complex integration work:
Converting your events
RudderStack builds size/time-bound batches of events converted from JSON to columnar format, according to our predefined schema, as they come in. These staging files are delivered to user-defined object storage.
Creating and delivering load files
Once the staging files are delivered, RudderStack regroups them by event name and loads them into their respective tables at a user chosen frequency–from every 30 minutes up to 24 hours. These "load files" are delivered to the same user-defined object storage.
Loading data to Delta Lake
Once the load files are ready, our Databricks integration loads the data from the generated files into Delta Lake.
Handling schema changes
RudderStack handles schema changes automatically, such as the creation of required tables or the addition of columns. While RudderStack does this for ease of use, it does honor user set schemas when loading the data. In the case of data type mismatches, the data would still be delivered for the user to backfill after a cleanup activity.
If you want to get value out of the customer event data in your data lakehouse more easily, and you don't want to worry about building event ingestion infrastructure, you can sign up for RudderStack to test drive the Databricks integration today. Simply set up your data sources, configure Delta Lake as a destination, and start sending data.
Setting up the integration is straightforward and follows a few key steps:
Refer to RudderStack's documentation for a detailed step-by-step guide on sending event data from RudderStack to Delta Lake.