Boosting blockchain and finance data with streaming ingestion
Coinbase turned to Databricks to help them process data with low latency
reduction in processing time to backfill the entire Ethereum history
of thousands of CDC traffic updates processed per second
of thousands of ingress traffic events processed per second
Coinbase is a 24/7 global fintech organization. The company has millions of users who generate a staggering number of mobile and web application transactions. Not only does Coinbase generate blockchain data, but the company also combines their entire ecosystem data into feature, index and analytic datasets to power Coinbase products and offer better insights to their customers.
Data latency issues block real-time incident analysis and metrics
With blockchain rapidly paving the road to Web 3.0, Coinbase was facing data management issues, creating a less-than-optimal experience for both external users and internal teams.
With their Airflow-based, Kafka-to-Snowflake ETL pipelines, Coinbase struggled with efficiently processing the high volumes of data required to serve their rapidly growing user base. With some Kafka topics delivering up to hundreds of thousands of events per second, the platform was unable to keep up. High latency hindered processes such as ad hoc incident analysis for real-time customer support, real-time metrics for dashboards, production monitoring and anomaly detection.
A second challenge was the scattered nature of Coinbase’s data. “Think about smart contracts and autonomous data generation and all these cryptocurrency protocols — thousands of them — generating data in a decentralized way,” explained Eric Sun, Senior Engineering Leader at Coinbase. Because the company’s data was siloed, Coinbase lacked a centralized location for joins and aggregations, making it difficult to derive fast, high-quality and meaningful insights. The fragmented data required frequent resyncs, adding to the inefficiency. Additionally, Coinbase was unable to quickly replicate tables of any size from different sources, making accessibility for team members writing SQL a major challenge.
Finally, the introduction of various databases at Coinbase led to high operational overhead for data ingestion teams. Maintaining multiple codebases in different programming languages resulted in a long learning curve for new members and inconsistencies in user experience, documentation and onboarding processes. Furthermore, different architectures for various ingestion and replication pipelines (PostgresSQL, MongoDB, Dynamo, MySQL, Kafka) complicated user onboarding and increased maintenance and operational overhead.
A blockchain-ready streaming ingestion framework
Turning to Databricks proved pivotal for Coinbase, as it enabled the creation of SOON (Spark cOntinuOus iNgestion), the company’s streaming ingestion framework. This framework allows the company to process with low latency and backfill quickly. The custom Apache Spark™ Streaming solution is built on the Databricks Data Intelligence Platform. This comprehensive solution is a unified, configuration-driven streaming ingestion framework from Kafka to Delta Lake.
SOON ingests the CDC and non-CDC events into Databricks Delta Lake in near real-time using Spark Structured Streaming APIs. This unified ingestion framework has simplified the user onboarding experience and streamlined maintenance for the team. With Spark Structured Streaming at the core of SOON, “we can access data much faster, powering our hourly processes efficiently,” explained Sun. “The addition of the table replication service ensures business continuity and agility, while the unified framework supports both append-only and merge-updates for various scenarios, including merge CDC and non-CDC events.”
The Databricks Data Intelligence Platform has added significant technical value to this setup. It provides advanced analytics capabilities, seamless integration with a variety of data sources and powerful machine learning support. This platform has enhanced Coinbase's ability to derive actionable insights from their data, ensuring high-quality, real-time data processing and decision-making.
To address the scattered nature of data and the need for centralized processing, the Data Intelligence Platform supports various data formats, enabling the ingestion of non-CDC events in JSON, protobuf, and binary formats with custom UDFs, while CDC events follow a standard SOON CDC schema in JSON format. SOON supports both internal and external users, streamlining processes and supporting scalability in a growing data ecosystem. “Databricks offers the open standard to quickly integrate all the Python and Java libraries we need to decode and interact with the blockchain,” said Sun. “Its streaming technology allows us to process data with low latency, and its powerful batch processing capabilities enable rapid backfilling.”
Driving blockchain technology with near real-time streaming
To address the complexity and inefficiency of multiple data ingestion processes, the introduction of SOON built on the Data Intelligence Platform has unified all streaming ingestion scenarios at Coinbase. SOON simplifies the entire system by providing a single onboarding experience and focusing maintenance on one framework. Consequently, operational overhead is significantly reduced, user onboarding is streamlined and consistency is ensured across the board. This simplification has improved performance and reduced latency, with configurable settings to balance cost and data freshness. For the team’s biggest append-only table, latency is configured to 15 minutes, and for the largest CDC-merge table, it is 30 minutes, with smaller tables refreshing every minute (large CDC-merge tables can easily have their latency configured to less than 30 minutes if justified by business requirements). SOON handles high ingress traffic efficiently, managing large CDC-merge tables and high inbound CDC update traffic. “We can now get the data quickly and easily, and this enables us to capture a lot of vital business metrics with very low latency so we can respond to production monitoring upsell opportunities much more quickly,” said Sun.
With SOON, Coinbase can power their blockchain as a service in near real-time and generate a variety of graphs, time series and key-value pairs with very low latency. Additionally, Delta Sharing allows Coinbase to share augmented, enriched data with the whole industry and monetize it. This seamless integration not only enhances data-driven decision-making but also establishes Coinbase as a pivotal data provider in the blockchain ecosystem.
The impact of building SOON on top of the Data Intelligence Platform is evident not only in Coinbase’s technical optimizations but also in enhanced overall business agility and competitiveness in an exponentially data-driven landscape. Databricks supports Coinbase in optimizing their business intelligence, building their customer base and continuing to grow without data bottleneck issues.
Looking ahead, Sun predicts that, with Databricks as a foundation, the possibilities are endless for Coinbase. “In two to three years, all our ML workloads will run on the Databricks Data Intelligence Platform — from feature generation to core metrics — all in near real-time, and all handled by Delta Lake and Spark Structured Streaming.”