This is a collaborative post from Databricks and YipitData. We thank Engineering Manager Hillevi Crognale at YipitData for her contributions.
YipitData is the trusted source of insights from alternative data for the world’s leading investment funds and companies. We analyze billions of data points daily to provide accurate, detailed insights on many industries, including retail, e-commerce marketplaces, ridesharing, payments, and more. Our team uses Databricks and Databricks Workflows to clean and analyze petabytes of data that many of the world’s largest investment funds and corporations depend on.
Out of 500 employees at YipitData, over 300 have a Databricks account, with the largest segment being data analysts. The Databricks platform's success and penetration at our company is largely a result of a strong culture of ownership. We believe that analysts should own and manage all of their ETL end-to-end with a central Data Engineering team supporting them through guardrails, tooling, and platform administration.
Adopting Databricks Workflows
Historically, we have relied on a customized Apache Airflow installation on top of Databricks for data orchestration. Data orchestration is essential to our business operating as our products are derived from joining hundreds of different data sources in our petabyte-scale Lakehouse on a daily cadence. These data flows were expressed as Airflow DAGs using the Databricks operator.
Data analysts at YipitData set up and managed their DAGs through a bespoke framework developed by our Data Engineering platform team, and expressed transformations, dependencies, and cluster t-shirt sizes in individual notebooks.
We decided to migrate to Databricks Workflows earlier this year. Workflows is a Databricks Lakehouse managed service that lets our users build and manage reliable data analytics workflows in the cloud, giving us the scale and processing power we need to clean and transform the massive amounts of data we sit on. Moreover, its ease of use and flexibility means our analysts can spend less time setting up and managing orchestration and instead focus on what really matters– using the data to answer our clients' key questions.
With over 600 DAGs active in Airflow before this migration, we were executing up to 8,000 data transformation tasks daily. Our analysts love the productivity tailwind from orchestrating their work, and our company has had great success from them doing so.
While Airflow is a powerful tool and has served us well, it had several drawbacks for our use case:
"If we went back to 2018 and Databricks Workflows was available, we would never have considered building out a custom Airflow setup. We would just use Workflows."
Once Databricks Workflows was introduced, it was clear to us that this would be the future. Our goal is to have our users do all of their ETL work on Databricks, end-to-end. The more we work with the Databricks Lakehouse Platform, the easier it is both from a user experience, and a data management and governance perspective.
How we made the transition
Overall, the migration to Workflows has been relatively smooth. Since we already used Databricks notebooks as the tasks in each Airflow DAG, it was a matter of creating a workflow instead of an Airflow DAG based on the settings, dependencies, and cluster configuration defined in Airflow. Using the Databricks APIs, we created a script to automate most of the migration process.
"To us, Databricks is becoming the one-stop shop for all of our ETL work. The more we work with the Lakehouse Platform, the easier it is for both users and platform administrators."
Workflows have several features that greatly benefit us:
The Databricks platform lets us manage and process our data at the speed and scale we need to be a leading market research firm in a disruptive economy. Adopting Workflows as our orchestration tool was a natural step given how integrated we already are with the platform, and the success we’ve experienced from being so. When we can empower our users to own their work and get their jobs done more efficiently, everybody wins.
To learn more about Databricks Workflows check out the Databricks Workflows page, watch the Workflows demo and enjoy and end-to-end demo with Databricks Workflows orchestrating streaming data and ML pipelines on the Databricks Demo Hub.