Cost effective ETL processing designed for performance and simplicity
This overview walks users through building data pipelines with Databricks, across batch and streaming data. Learn how to ingest data, build data pipelines, run them in production, and automate these processes for reliability and scale.
Apache Spark™ is the go-to open source technology used for large scale data processing. Its speed, ease of use, and broad set of capabilities makes it the swiss army knife for data, and has led to it replacing Hadoop and other technologies for data engineering teams. Spark is an open source project hosted by the Apache Software Foundation. Databricks was founded by the original creators of Apache Spark, and has embedded and optimized Spark as part of a larger platform designed for not only data processing, but also data science, machine learning, and business analytics.
Apache Spark supports Scala, Java, SQL, Python, and R, as well as many different libraries to process data. A wide variety of data sources can be connected through data source APIs, including relational, streaming, NoSQL, file stores, and more. Databricks can also connect to a variety of AWS and Azure services, with a rich set of additional data ingest capabilities and partners for applications, mainframes, and more, depending on if data processing is happening in place, or if the data is being copied.
Once you’ve extracted the data from your data sources and landed it into the cost-effective cloud blob storage of your data lake, you can now develop the transformation code to filter, clean, and aggregate your raw data. Write data processing code in Scala, Java, SQL, Python, or R leveraging integrated cloud-based collaborative notebooks, or Databricks Connect to attach your preferred IDE or Notebook, such as IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio, Zeppelin, Jupyter, or other custom applications. You can also leverage visual drag-and-drop pipeline builders like Informatica or Talend to set up the transformation process.
Data reliability is an important issue for data pipelines. Failed jobs can corrupt and duplicate data with partial writes. Multiple data pipelines reading and writing concurrently to your data lake can compromise data integrity. Delta Lake is an open source storage layer for your existing data lake, and uses versioned Apache Parquet™ files and a transaction log to keep track of all data commits and deliver reliability capabilities to Spark. ACID transactions ensure that multiple data pipelines can simultaneously read and write data reliably on the same table. Schema Enforcement ensures data types are correct and required columns are present, and Schema Evolution allows these requirements to change as data changes.
Getting started with open source Delta Lake is quick and easy, and you can begin with a simple change in commands to write data into Delta format. Parquet files can be easily converted in place to the open Delta Lake format, and you can easily convert Delta tables back to Parquet as well. Once started with Delta Lake, you can unify batch and stream processing under a single, simplified architecture.
The data reliability guarantees provided by Delta Lake across batch and streaming enable new data architecture patterns. A “medallion” model takes raw data landed from source systems and progressively refines the data though bronze, silver, and gold tables. Streaming data pipelines automatically read and write the data through the different tables, with data reliability ensured by Delta Lake. This results in data continuously flowing through your data lake and providing end users with the most complete, reliable, up-to-date data available.
Process data with the Databricks Runtime, which leverages a highly optimized version of Apache Spark that has been tuned by the original creators of the project, for up to 50x performance gains. These runtime images offer different versions of software, including Ubuntu, Scala, Java, Python, and R, which are continually maintained and updated to support new releases. Easily select the desired Databricks runtime version for your data processing, or bring your own container image, which can then be run on cloud compute resources on Azure or AWS
Abstract away infrastructure complexity and resource management. Databricks makes it easy to self service compute resources for data processing needs, while also maintaining administrative control over usage. Clusters intelligently auto-scale with workloads, auto-terminate with inactivity, and can be sized across a variety of optimized memory and compute configurations. Cluster environments can be managed with performance monitoring and cluster logs. For your clusters select from different prebuilt Databricks runtime versions, or bring your own container image for custom environments.
With your data processing logic ready and clusters setup, you can now configure how you’d like your jobs to run, such as on a set schedule for batch jobs, or to run continuously with Structured Streaming. You can configure dependencies or job frequency, as well as review any active or previously completed runs. Jobs can be scheduled against notebooks or custom JARs with your data processing code. Manage the job creation and execution through main UI, CLI, or API, and set up alerts on job status through email or with notification systems like PagerDuty or Slack.
Easily connect Databricks into your existing CI/CD processes, across development, staging, and production environments. Sync notebooks with your version control systems, such as with GitHub, to see branches and history for local development. During your build and deployment processes, such as with Jenkins, you can push the release artifact of compiled code and configuration files to blob storage as a JAR file with the Databricks CLI/API, which can then be read by a Databricks workspace. Calling the Databricks API can be used to update an existing data processing job to point to the new JAR file for the transformation code, or create a new job to be scheduled for the release artifact.
CI/CD fits within a larger framework of best practices to improve the processes and reliability of how data pipelines are built, tested, and managed in production with Databricks. Leverage unit tests and integration tests to validate data processing works as intended, and that the overall system connects together. Troubleshoot issues with different logs available for Spark drivers/workers, cluster events, and system output. Monitor production cluster performance with Ganglia and Datadog metrics, and check operational information on the Databricks status page.
Healthdirect uses Databricks to process Terabytes of data, leveraging fine-grained table features and data versioning to solve duplication and eliminate data redundancy. This has enabled them to develop and provide high-quality data to improve Health Services demand forecasting and clinical outcomes in service lines like Aged Care and Preventative Health.