Build Scalable Data Pipelines with Apache Spark™

Cost effective ETL processing designed for performance and simplicity

This overview walks users through building data pipelines with Databricks, across batch and streaming data. Learn how to ingest data, build data pipelines, run them in production, and automate these processes for reliability and scale.

The Challenge

BEFORE

  • Complex, redundant systems, and operational challenges to process batch and streaming data
  • Unreliable data processing jobs that require manual cleanup and reprocessing after failed jobs
  • Long data processing times and increased infrastructure costs from inefficient data pipelines
  • Static infrastructure resources incurring expensive overhead costs and limited workload scalability
  • Unscalable processes, with tight dependencies, complex workflows, and system downtime

The Solution

AFTER

  • Unified and simplified architecture across batch and streaming to serve all use cases
  • Robust data pipelines that ensure data reliability with ACID transaction and data quality guarantees
  • Reduced compute times and costs with a scalable cloud runtime powered by highly optimized Spark clusters
  • Elastic cloud resources intelligently auto-scale up with workloads and scale down for cost savings
  • Modern data engineering best practices for improved productivity, system stability, and data reliability

Fast and easy data processing

Use the de facto standard for big data processing

Apache Spark™ is the go-to open source technology used for large scale data processing. Its speed, ease of use, and broad set of capabilities makes it the swiss army knife for data, and has led to it replacing Hadoop and other technologies for data engineering teams. Spark is an open source project hosted by the Apache Software Foundation. Databricks was founded by the original creators of Apache Spark, and has embedded and optimized Spark as part of a larger platform designed for not only data processing, but also data science, machine learning, and business analytics.

Learn more

Leverage the data ecosystem around Apache Spark and Databricks

Apache Spark supports Scala, Java, SQL, Python, and R, as well as many different libraries to process data. A wide variety of data sources can be connected through data source APIs, including relational, streaming, NoSQL, file stores, and more. Databricks can also connect to a variety of AWS and Azure services, with a rich set of additional data ingest capabilities and partners for applications, mainframes, and more, depending on if data processing is happening in place, or if the data is being copied.

Learn more

Develop the data transformation logic for your pipeline

Once you’ve extracted the data from your data sources and landed it into the cost-effective cloud blob storage of your data lake, you can now develop the transformation code to filter, clean, and aggregate your raw data. Write data processing code in Scala, Java, SQL, Python, or R leveraging integrated cloud-based collaborative notebooks, or Databricks Connect to attach your preferred IDE or Notebook, such as IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio, Zeppelin, Jupyter, or other custom applications. You can also leverage visual drag-and-drop pipeline builders like Informatica or Talend to set up the transformation process.

Learn more

Improve data reliability with Delta Lake and Spark across batch and streaming

Data reliability is an important issue for data pipelines. Failed jobs can corrupt and duplicate data with partial writes. Multiple data pipelines reading and writing concurrently to your data lake can compromise data integrity. Delta Lake is an open source storage layer for your existing data lake, and uses versioned Apache Parquet™ files and a transaction log to keep track of all data commits and deliver reliability capabilities to Spark. ACID transactions ensure that multiple data pipelines can simultaneously read and write data reliably on the same table. Schema Enforcement ensures data types are correct and required columns are present, and Schema Evolution allows these requirements to change as data changes.

Learn more

Convert Parquet to Delta Lake with a simple command change

Getting started with open source Delta Lake is quick and easy, and you can begin with a simple change in commands to write data into Delta format. Parquet files can be easily converted in place to the open Delta Lake format, and you can easily convert Delta tables back to Parquet as well. Once started with Delta Lake, you can unify batch and stream processing under a single, simplified architecture.

Learn more

Architect for continuous data flow and progressive refinement

The data reliability guarantees provided by Delta Lake across batch and streaming enable new data architecture patterns. A “medallion” model takes raw data landed from source systems and progressively refines the data though bronze, silver, and gold tables. Streaming data pipelines automatically read and write the data through the different tables, with data reliability ensured by Delta Lake. This results in data continuously flowing through your data lake and providing end users with the most complete, reliable, up-to-date data available.

Learn more

Optimized runtimes ready for processing workloads

Process data with the Databricks Runtime, which leverages a highly optimized version of Apache Spark that has been tuned by the original creators of the project, for up to 50x performance gains. These runtime images offer different versions of software, including Ubuntu, Scala, Java, Python, and R, which are continually maintained and updated to support new releases. Easily select the desired Databricks runtime version for your data processing, or bring your own container image, which can then be run on cloud compute resources on Azure or AWS

Learn more

Self-service data processing with managed Spark clusters

Abstract away infrastructure complexity and resource management. Databricks makes it easy to self service compute resources for data processing needs, while also maintaining administrative control over usage. Clusters intelligently auto-scale with workloads, auto-terminate with inactivity, and can be sized across a variety of optimized memory and compute configurations. Cluster environments can be managed with performance monitoring and cluster logs. For your clusters select from different prebuilt Databricks runtime versions, or bring your own container image for custom environments.

Learn more

Schedule production jobs for batch and streaming

With your data processing logic ready and clusters setup, you can now configure how you’d like your jobs to run, such as on a set schedule for batch jobs, or to run continuously with Structured Streaming. You can configure dependencies or job frequency, as well as review any active or previously completed runs. Jobs can be scheduled against notebooks or custom JARs with your data processing code. Manage the job creation and execution through main UI, CLI, or API, and set up alerts on job status through email or with notification systems like PagerDuty or Slack.

Learn more.

Continuous integration & continuous delivery on Databricks

Easily connect Databricks into your existing CI/CD processes, across development, staging, and production environments. Sync notebooks with your version control systems, such as with GitHub, to see branches and history for local development. During your build and deployment processes, such as with Jenkins, you can push the release artifact of compiled code and configuration files to blob storage as a JAR file with the Databricks CLI/API, which can then be read by a Databricks workspace. Calling the Databricks API can be used to update an existing data processing job to point to the new JAR file for the transformation code, or create a new job to be scheduled for the release artifact.

Learn more

Bring software development best practices to data engineering

CI/CD fits within a larger framework of best practices to improve the processes and reliability of how data pipelines are built, tested, and managed in production with Databricks. Leverage unit tests and integration tests to validate data processing works as intended, and that the overall system connects together. Troubleshoot issues with different logs available for Spark drivers/workers, cluster events, and system output. Monitor production cluster performance with Ganglia and Datadog metrics, and check operational information on the Databricks status page.

Learn more

Customer Stories

How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity

Healthdirect uses Databricks to process Terabytes of data, leveraging fine-grained table features and data versioning to solve duplication and eliminate data redundancy. This has enabled them to develop and provide high-quality data to improve Health Services demand forecasting and clinical outcomes in service lines like Aged Care and Preventative Health.

Ready to Get Started?