On June 25th, our team hosted a live webinar — Getting Data Ready for Data Science — with Prakash Chockalingam, Product Manager at Databricks.
Successful data science relies on solid data engineering to furnish reliable data. Data lakes are a key element of modern data architectures. Although data lakes afford significant flexibility, they also face various data reliability challenges. Delta Lake is an open source storage layer that brings reliability to data lakes allowing you to provide reliable data for data science and analytics. Delta Lake is deployed at nearly a thousand customers and was recently open sourced by Databricks.
The webinar covered modern data engineering in the context of the data science lifecycle and how the use of Delta Lake can help enable your data science initiatives. Topics areas covered included:
- The data science lifecycle
- The importance of data engineering to successful data science
- Key tenets of modern data engineering
- How Delta Lake can help make reliable data ready for analytics
- The ease of adopting Delta Lake for powering your data lake
- How to incorporate Delta Lake within your data infrastructure to enable Data Science
If you are interested in learning more technical detail we encourage you to also check out the webinar “Delta Lake: Open Source Reliability for Data Lakes” by Michael Armbrust, Principal Engineer responsible for Delta Lake. You can access the Delta Lake code and documentation at the Delta Lake hub.
Toward the end of the webinar, there was time for Q&A. Here are some of the questions and answers.
Q: Is Delta Lake available on Docker?
A: You can download and package Delta Lake as part of your Docker container. We are aware of some users employing this approach. The Databricks platform also has support for containers. If you use Delta Lake on the Databricks platform then you will not require extra steps since Delta Lake is packaged as part of the platform. If you have custom libraries, you can package them as docker containers and use them to launch clusters.
Q: Is Delta architecture good for both reads and writes?
A: Yes, the architecture is good for both reads and writes. It is optimized for throughput for both reads and writes.
Q: Is MERGE available on Delta Lake without Databricks i.e. in the open source version?
A: While not currently available as part of the open source version MERGE is on the roadmap and planned for the next release in July. Its tracked in Github milestones here.
Q: Can you discuss about creating feature engineering pipeline using Delta Lake.
A: Delta Lake can play an important role in your feature engineering pipeline with schema on write helping ensure that the feature store is of high quality. We are also working on a new feature called Expectations that will further help with managing how tightly constraints are applied to features.
Q: Is there a way to bulk move the data from databases into Delta Lake without creating and managing a message queue?
A: Yes, you can dump the change data to ADLS or S3 directly using connectors like GoldenGate. You can then stream from cloud storage. This eliminates the burden of managing a message queue.
Q: Can you discuss the Bronze, Silver, Gold concept as applied to tables.
A: The Bronze, Silver, Gold approach (covered in more detail in an upcoming blog) is a common pattern that we see in our customers where raw data is ingested and refined successively to different degrees and for different purposes until eventually one has the most refined “Gold” tables.
Q: Does versioning operate at a file or table or partition level?
A: Versioning operates at a file level so whenever there are updates Delta Lake identifies which files are changed and maintains appropriate information to facilitate Time Travel.