This blog is authored by Michael Ewins, Director of Engineering at Skyscanner
At Skyscanner, we're more than just a flight search engine. We are a global leader in travel in serving more than 110 million users each month to plan and book their trips with confidence and ease. Operating in over 30 languages, our platform connects travelers with a wide range of flights, hotels, and car rental options from over 1,200 travel partners across 180 countries.
We use data and AI to enhance the traveler experience as well as support internal decision-making. For our travelers, we use machine learning (ML) models to check over 80 billion prices every day, ranking and recommending hotels, flights, and car rentals, aiming to provide the best options based on journey time and costs. Databricks Data Intelligence Platform powers some of these travel insights. In this blog, we discuss our journey with Databricks and how Unity Catalog helped us streamline our data management and governance.
To learn more, attend the Data + AI Summit 2024 for our session titled Skyscanner’s Journey of Enabling Practical Data and AI Governance.
Data has always been central to Skyscanner’s operations. Every day, our platform handles 35 million searches, generating over 30 to 35 billion analytical events. The sheer volume of data—approximately 15 to 20 petabytes stored at any given time—poses significant challenges in data management and utilization. Our data is crucial for both consumer-facing features and internal decision-making processes, making its effective management a top priority for our engineering teams. This scale of data operations presents several challenges:
At Skyscanner, our commitment to leveraging cutting-edge technology is evident in our strategic partnership with Databricks. Databricks has been instrumental in transforming our approach to data management, enabling us to streamline operations and enhance the traveler experience.
All our data pipelines are built on top of the Databricks Data Intelligence Platform. we've established a robust data ingestion framework that captures data from a variety of sources, incorporating both batch and real-time streams. We utilize AWS Kinesis for streaming and Fivetran for batch data ingestion, ensuring that all incoming data is collected efficiently into our initial staging area, which we refer to as the 'bronze layer' of our medallion architecture. This stage is crucial as it handles the raw data collected from our diverse channels, including direct interactions from our web and mobile platforms.
Once in the bronze layer, the data undergoes a series of transformations and enrichments to prepare it for deeper analytical tasks. It then moves to the 'silver layer,' where it is cleaned, consolidated, and structured, ready for analytical consumption. In this phase, Databricks’ powerful Spark engine plays a crucial role, enabling fast and scalable data transformations.
Advancing the data to the 'gold layer,' our data is optimized for consumption by various business units where it is modeled and aggregated into metrics that directly support decision-making across the company. We leverage MLflow, to manage the complete machine learning lifecycle. This includes everything from experimentation and reproducibility to the deployment of ML models, allowing us to track experiments, package code into reproducible runs, and deploy models directly into production seamlessly. While we’re currently serving these models into production using our own model-serving architecture, we’re in the process of evaluating Databricks’ model-serving capabilities that are part of the Databricks Mosaic AI offering.
Beyond processing and machine learning, we utilize Databricks for operational reporting and analytics. Databricks SQL allows our teams to perform SQL queries directly against our data lake, create dashboards, and execute complex analytical operations at scale. Integration with BI tools like Tableau Cloud enhances our capabilities, enabling us to visualize data and extract actionable insights efficiently.
Data governance is a critical component of Skyscanner's architecture. It underpins our ability to manage data securely and efficiently, ensuring that we can trust our data for making business decisions and maintaining compliance with global data protection regulations, including GDPR. As a subsidiary of a company listed on NASDAQ, adhering to strict regulatory standards such as the Sarbanes-Oxley Act is paramount for ensuring transparency and accountability in our operations. Databricks Unity Catalog, being built into the platform, helped us streamline these requirements.
Before implementing Unity Catalog, we faced several significant challenges
Recognizing these challenges, we developed a strategic approach to migrate to Unity Catalog. Our strategy included:
Unity Catalog has become a pivotal element in our data governance framework at Skyscanner. It now manages and governs a significant volume, approximately 15 to 20 petabytes, of our data. This data includes everything from raw data in our 'bronze' layer to processed data in our ’silver’ and ’gold’ layers, which are used extensively across various business functions for analytical and operational purposes.
The implementation of Unity Catalog has brought substantial improvements to our data management and governance capabilities, yielding several key benefits:
As we look ahead, I think the value in generative AI will come from the unique, valuable data we have at Skyscanner. There’s a lot of potential, but a key step for us is making sure we have everything, including ML models, managed and governed with Unity Catalog to capitalize on any opportunities.
Currently we’re evaluating using Databricks’ Model Serving capability. We’re looking at enabling Unity Catalog in multiple regions using Delta Sharing to move data between regions. We’re also thinking about using this for external data sharing – we have some data products where we share data with third party companies.
In the future, we want our data teams to focus on problems unique to Skyscanner. Databricks does a lot of the heavy lifting when it comes to model serving and provides a good framework for thinking about the AI journey—from prompt engineering to building your own model. We have confidence in our ability to realize the opportunities we’re identifying using the Databricks ecosystem.
Learn more about Skyscanner’s journey at the Data + AI 2024 Summit by joining Michael’s session, Skyscanner’s Journey of Enabling Practical Data and AI Governance.