In this demo, we walk through a high-level overview of the Databricks unified data analytics platform, including discussion of how open source projects including Apache SparkTM, Delta Lake, MLflow, and Koalas, fit into the Databricks ecosystem. We then cover the Data Science Workspace, launching Spark clusters, and collaborative notebook features, before shifting our focus to Delta Lake, time travel, and SQL Analytics.
Databricks provides a unified, open platform for all your data. It empowers data scientists, data engineers, and data analysts with a simple collaborative environment to run interactive, and scheduled data analysis workloads.
Databricks is from the original creators of some of the world’s most popular open source projects, Apache SparkTM, Delta Lake, MLflow, and Koalas. It builds on these technologies to deliver a true lakehouse architecture, combining the best of data lakes and data warehouses for a fast, scalable, and reliable data platform.
Built for the cloud, your data is stored in low cost cloud object stores such as AWS S3, and Azure Data Lake Storage, and Google Cloud Storage with performant access enabled through caching, optimized data layout, and other techniques.
To work with your data, you can launch clusters with hundreds of machines, each with a mixture of CPUs and GPUs needed for your analysis. If you’re on a large data team, policies can define limits on cluster sizes and configuration. There is a Databricks Runtime for data engineers and data scientists, as well as a Runtime optimized for machine learning workloads. See how easy it is to create a cluster with up to 390 workers.
In the Data Science Workspace, you can create collaborative notebooks using Python, SQL, Scala, or R.
Just like you can share your Google Docs with your colleagues and groups of colleagues, you can also share these notebooks. Plus, built-in commenting tied to your code helps you exchange ideas and updates with your colleagues.
In addition to using notebooks for exploratory data analysis, as you see here, many Databricks users love the powerful integration with machine learning frameworks like MLflow. Here, we’re training a model and testing it. But we can also look up at the top here and see the MLflow Experiment tracking, which allows us to track the previous experiment runs, and look how important variables like accuracy changed over time.
Now MLflow is just one of the integrations that Databricks provides with popular frameworks for machine learning and data science. Databricks also supports a variety of other open source libraries, which are popular in the community.
Want to know more about what data your colleagues have shared with you? Take a look at the Data tab where you can see individual tables with schema and sample data. Importantly, you see the history of operations performed on each table, a.k.a. the transaction log. Now why does history matter? Well, it’s important for compliance and security audits in many industries. But it also enables you to explore your data by another dimension: time. Let’s see how, by opening up the SQL analytics interface.
The SQL Analytics interface gives us the ability to create visualizations and dashboards, as well as query our lakehouse with performance exceeding or comparable to traditional data warehouses. We achieve this level of performance, reliability, schema enforcement, and scale through advances in Delta Lake and Delta Engine. Delta Lake is an open format storage layer built on top of parquet, which adds ACID transactions to your cloud data lake. Let’s show you how the transaction log enables Delta Lake time travel.
Here we’re looking at a series of loan risk scores based on where a property is located. When we originally created this data set in version zero, we didn’t have any data for Iowa. We didn’t have any loan applications there. But as time went on, and we’ve reached version 14, you can see that the loan risk score was added for Iowa.
Now let’s show you the SQL that powers these queries. And here you can see that we very simply have added a version number into our SQL query to indicate when we’re querying the data from. This is how we use the Delta Lake time travel feature in order to find the data at a particular point in time.
Now, what if you’ve just started filling your data lake? Well, the Databricks Ingest feature allows you to easily load data into your lakehouse to enable BI and ML. So, it’s really easy to get started today.
Well, I hope you’ve seen how simple and powerful Databricks can be for your entire data team. Whether data analysts, data engineers, or data scientists, they can collaborate together to do their data plus AI on Databricks. Learn more at databricks.com.