Empowering business owners to make smart financial decisions
KCD optimized data processing to reduce costs by 90% and boost user satisfaction
Reduction in management costs through pipeline optimization strategies
Reduction in additional cluster maintenance costs with Unity Catalog
Reduction in cluster execution costs through optimization
As a business owner, you might often wonder, “Why isn’t my business making enough money?” or “How many regular customers do I have?” Korea Credit Data (KCD) helps answer these questions by connecting business data to empower owners to make informed decisions. KCD’s Cashnote app — used by 1.4 million business owners — provides insights into credit card sales, credit score management and financial benefits. Wanting to expand their services to include credit scoring, payment, marketing and POS management, KCD adopted the Databricks Data Intelligence Platform to add optimization strategies to their data management workflows, streamlining operations and reducing time and overall costs.
Difficulties in tracking costs, data lineage and more
Korea Credit Data — a fintech company that’s always been concerned with helping small-to-medium enterprises (SMEs) optimize financial data management, enhance decision-making and support their sustainable business growth — developed the data-driven Cashnote app to achieve all of these initiatives for their valued customers. Better yet, the app allowed business owners to overcome a variety of common obstacles, including approvals, purchases, deposits and sales. Cashnote users are given detailed views of credit card sales that are broken down by eight different credit card companies, as well as cash sales and tax invoices. Along with providing these quick glimpses of important information, the app can also recommend financial instruments, manage credit scores of both businesses and their owners, and provide expedient updates on government subsidies to help users access financial benefits. Finally, Cashnote Market enables business owners to buy food ingredients in bulk, and Cashnote Community gives them a space to share know-how and concerns in productive ways.
With so many services offered to their users, KCD needed to collect and cleanse large amounts of business data and efficiently apply these insights to continually develop their offerings. However, with their existing AWS platform, KCD had the very manual task of updating the collected and cleansed data to a relational database service (RDS) and then uploading the transaction data to a different RDS. “Queries that joined around 30 billion data sources to create this aggregated data were highly complex,” explained Sangyoung Park, Data Engineer at KCD. “Not to mention, these transaction aggregation queries had to be executed continuously with each new data ingestion, leading to significant financial and time costs.”
The existing platform also had challenges with tracking costs, learning frameworks and tracing data lineage. Since a large number of batch jobs used to run on just two Elastic MapReduce (EMR) clusters, tracking costs per cluster was relatively easy, but it was difficult to see how many resources each batch job actually used. This also made it tough for KCD’s new hires to learn the existing framework. Since the data pipeline development process used a proprietary framework rather than a language commonly used in data engineering, the new hires had to learn and understand a whole new framework before they could even build a data pipeline and respond to errors. Due to these difficulties in tracing data lineage, the existing system made it impossible to communicate with relevant departments when errors or changes occurred in specific data points. Consequently, KCD adopted the Databricks Data Intelligence Platform to solve these multifaceted challenges.
Managing customer data transaction pipelines efficiently
KCD adopted the Databricks Platform to enhance the efficiency of managing customer data, which was crucial for Cashnote’s service delivery. With Databricks, KCD built, processed and monitored customer data transaction pipelines, enabling Cashnote to offer financial services that provided customers with easy access to aggregated sales revenue, credit card sales and delivery app data through a user-friendly dashboard. First, KCD empowered their staff to fully utilize the Databricks Platform by migrating large volumes of customer data to Delta Lake. The fast and efficient analytics in Delta tables significantly reduced data latency, allowing KCD to deliver quicker and more accurate financial services to their customers.
Next, KCD leveraged Unity Catalog to tag all batch jobs and clusters with identifiers, such as owner, team and category, which enabled clear visibility of resource consumption by project across the organization. Daily and weekly notifications helped limit maintenance costs for unlisted clusters and ensured that missing tags were promptly addressed. This approach allowed KCD to quickly identify resource-intensive projects to reduce operational costs by optimizing batch jobs that exceeded certain resource-related thresholds.
Databricks also simplified the onboarding process and eliminated the need for new hires to learn existing frameworks up front, allowing them to acclimate and contribute to projects more quickly. With support for major data engineering languages like SQL, Scala and Python, Databricks improved not only productivity but also data accessibility and usability. Ultimately, the automatic generation of data lineage through Unity Catalog solved the challenge of tracing it, resulting in KCD’s ability to quickly assess the impact of changes or errors. This allowed them to notify affected users when necessary, greatly improving operational and communication efficiency.
Furthermore, KCD optimized their transaction pipeline using Databricks’ built-in features. For instance, Auto Loader uploaded data collected from the data ingestion layer into a Delta table, using what’s referred to as “File Notification” mode to send file creation events to Amazon SQS and update read targets. Since a transaction pipeline runs every 20 minutes, there was no longer a need for data ingestion to be constant. Now, the team experienced optimized costs by updating the specifications of the job cluster for more controlled costs over the data pipeline. “By leveraging various features inherent in the Databricks Platform, like REST APIs, usage alerts and the specifications of clusters or instance pools, we optimized pipelines with a high degree of freedom — and at an efficient cost,” said Park.
Experiencing massive cost savings compared to the previous system
With Databricks, KCD has significantly reduced pipeline management costs. Previously, they executed 700,000 transaction pipeline-related queries per day in RDS, leading to high I/O costs that amounted to tens of millions of won per month. After adopting Databricks, KCD began using Auto Loader to upload source data into Delta tables and update aggregated data, which drastically reduced I/O operations by syncing completed tables to RDS only when necessary.
Additionally, KCD could optimize costs by performing batch jobs on data collected in 20-minute increments and adjusting job cluster specifications via REST APIs. Previously, queries were executed every time data ingestion occurred for each site. The new approach saved 50% in costs compared to running clusters at full capacity for 24 hours. By tracking costs using tags within the Databricks Platform, KCD achieved further savings. Intensive cost monitoring from November 2023 to January 2024 led to a 70% reduction in expenses compared to the previous period. Overall, Databricks enabled the fintech company to reduce costs by nearly 90% compared to running the data pipeline in RDS.
Last but not least, the Databricks Data Intelligence Platform empowered KCD to optimize their most valuable resource: time. Streamlining the data ingestion process with Auto Loader’s File Notification mode stabilized network metrics and eliminated the 15-minute data latency issue that previously plagued the brand. Most notably, Databricks’ instance pools further reduced the overhead required to start a traditional job cluster from five minutes to just two minutes.
KCD overcame challenges related to tracking costs, learning frameworks and tracing data lineage using the Databricks Platform. “We are very much looking forward to the upcoming support for LLMs and serverless computing, which our teams will use to run resources directly into the Databricks Data Intelligence Platform,” relayed Park. He and his team couldn’t emphasize enough the improved efficiency they’ve experienced, from building and running pipelines with a high degree of freedom to how this success has set the bar high for KCD’s continued collaboration with Databricks.