Fueling growth with predictive models and improved customer experience
Faster release of new data products to the platform
Number of data sets brought online quarter-over-quarter
Explorium enables organizations to find the right data, build predictive models and make informed business decisions by integrating their data with the world’s most reliable sources. Its platform puts the world’s largest business data ecosystem at the fingertips of customers. Seeking to minimize data latency and free its data engineers from the task of building ELT pipelines, Explorium implemented Databricks Data Intelligence Platform and dbt. Today, the company’s analysts and data product developers can build their own pipelines without having to know complex languages such as Python. They can also deploy dbt on any Databricks cluster to test ELT pipelines thoroughly before deploying. Having these capabilities has dramatically reduced the time it takes Explorium to release data products and bring new data to its platform and to customers.
Integrating customers’ data with the world’s best sources
Explorium offers a powerful platform built on top of some of the world’s most reliable data sources. The company combines these two assets into a valuable product. Explorium’s customers rely on the platform to enrich their existing business data according to their specific needs. That’s why Explorium must ensure it can load the right data quickly, regardless of the technical challenges it faces on the back end.
“Our customers submit highly complex queries and expect our platform to serve up data they can’t get anywhere else,” explained Avshalom Chai, Technical Product Manager at Explorium. “We allow them to upload data sets of any size in any location. They want an easy, intuitive user experience and fast performance. To deliver all this and verify the quality and freshness of our data, we execute countless processes within our ELT function.”
Once a customer has uploaded a data set, the Explorium platform determines the characteristics of the data and identifies the potential enrichments it can make. For example, a data set that contains geographical locations might benefit from having traffic data added. Customers who upload massive data sets can see their results in a few hours, while customers who upload smaller data sets can see immediate enrichment. In either case, Explorium must minimize data latency. The company used Amazon EMR to run its ELT pipelines before realizing its data engineers were spending too much time building these pipelines
“As our data team grew, we wanted to let our data product developers and analysts build transformation logic using SQL without relying on our data engineers,” said Anton Peniaziev, Data Engineer and Tech Lead at Explorium. “Our data engineers needed to be concentrating on building data products, not pipelines. We started looking for a new data architecture that would make possible this shift in focus.”
Automated workflow frees up engineers to build infrastructure
Explorium found its new architecture when it deployed Databricks Data Intelligence Platform and dbt. Databricks offered auto-scaling features and sophisticated libraries for Delta tables, which freed the company’s engineers from optimizing tables and checking file sizes. These tasks are automated in Databricks, which saved time for engineers and allowed them to concentrate on building infrastructure.
“Our data engineers used to build ELT pipelines by writing Spark jobs in Scala or PySpark,” said Peniaziev. “Even when we used Apache Airflow to orchestrate these jobs, it wasn’t automated enough to deliver value. By using Databricks Workflows in our infrastructure, we’ve automated our most complex jobs throughout the medallion architecture.”
To ensure the highest data quality for its customers, Explorium now adheres to the medallion architecture, which describes three data layers of different quality that are to be stored in the lakehouse. Taking this multilayered approach can help companies build a single source of truth for their enterprise data products. To load raw data into the bronze layer, Explorium uses Databricks Auto Loader. To load validated data into the silver layer, the company built a transformation in SQL. For enriched data that belongs in the gold layer, Explorium extracts data from Delta Lake tables and ingests it into a warehouse or database. The Explorium platform will then retrieve data directly from this database or warehouse and serve it up to customers.
“With Databricks, we’ve built a simple automated workflow that has freed our data engineers from building ELTs,” remarked Peniaziev. “That task has moved to our data analysts and product developers, who don’t need to write Spark jobs or use Scala or Python — they can focus on configuring the ELTs, providing rich data sources and uploading using Databricks Auto Loader.”
One of the keys to this automation is dbt. Explorium wanted to be sure its data analysts and product developers were thoroughly testing their pipelines before deploying them. dbt provides these testing capabilities while eliminating the need to get data engineers involved to help with cluster definitions and sizes, permissions to connect to AWS resources, and other complex needs.
“We didn’t want our users to use custom logic to enrich data or verify its integrity,” said Chai. “To keep everyone in the right context, we’re using user-defined functions in dbt. This takes away the need for analysts and product developers to know languages such as Python because the UDFs contain prebuilt chunks of logic that they can deploy as needed. Another great thing about dbt is that we can install it on any Databricks cluster and get right to work. The integration between these two solutions is amazing!”
Product development pipeline moves 10 times faster
Explorium strives for the highest levels of data quality and the fastest speed as it serves up data to its customers. Since going live on Databricks and dbt, the company has delivered new data products to its platform 10 times faster.
“A transformation like this not only helps us meet customer needs more quickly but also has a profound effect on the demographics of our company,” Peniaziev noted. “Instead of needing so many data engineers, we can now focus on hiring people who will be much more concentrated on building up the domain and business knowledge of our business. We set a goal of having our engineers stop building ELTs and start building infrastructure, and we’ve achieved it.”
In the quarter after going live, Explorium onboarded twice as many new data sets as in the previous quarter. The company’s jobs run more quickly on Databricks than they did on EMR. And Explorium’s data teams have found that with Databricks and dbt, developers now need fewer contexts to be able to create a new ELT or update an existing one. All of this is adding up to a better experience for Explorium’s customers.
“We’ve been adding a new data product to our platform every year, which enables us to offer more sophisticated suggestions that keep our customers satisfied,” concluded Peniaziev. “Databricks and dbt are what keep our data engineers, analysts and product developers working efficiently so that we can keep exceeding expectations for anyone who uses the Explorium platform.”