Customer Case Study: Zip Co - Databricks


Customer Case Study


Zip Co is a leading payment and credit solution provider based out of Australia. They are on a mission to simplify how customers everywhere pay with fast, fair and seamless payment solutions online and in-store.


Financial Services

Vertical Use Case

Payment Service – Provides credit and risk analysis of applicants by using credit profiles like bank statements and other data to predict who is credit worthy.

Technical Use Case

  • Data Ingest and ETL
  • Machine Learning

The Challenges

Zip Co’s data science and engineering teams model large volumes of financial data (e.g. bank statements, transactions, etc) to predict credit risk. Complex architectures and data prep workflows delayed time-to-insight and drove up costs. Challenges included:

  • Managing infrastructure and clusters required significant DevOps effort from engineering team.
  • Using EMR to build ETL pipelines was highly complex and very costly due to the lack of features such as auto-scale and auto-shutdown.
  • Frequently experienced EMR clusters terminate in error when trying to auto-scale to support exploration.
  • Data preparation to build and train machine learning models time consuming and resource intensive.

The Solution

Databricks provides Zip Co with a unified analytics platform that simplifies operations and accelerates data preparation to accelerate data science driven innovation.

  • Fully migrated both interactive and automated workloads from EMR to Databricks
  • Automated cluster management simplifies provisioning of compute resources, allowing ETL work to get started promptly and providing data scientists control over their own workloads with ease.
  • Auto-scaling, Spot Instances, and Auto-shutdown features reduced compute costs for ETL workloads

The Results

  • Able to reduce total Spark cluster cost for by 65% in four months, including both AWS and Databricks charges.
  • Databricks clusters spin up in less than 3 minutes compared to a minimum of 12 minutes and sometimes up to 20 minutes on EMR
  • The combination of autoscaling, spot-pricing and auto-termination timeout settings allow us to maximize utilization of compute while minimizing money spent on idle resources.

The combination of autoscaling, spot-pricing and auto-termination has allowed us to maximize utilization of compute while minimizing money spent on idle resources.

Yiting Shan, Big Data Engineer at Zip Co.