On July 13th, we hosted a live webinar — Accelerate Data Science with Better Data Engineering on Databricks. This webinar focused on the use of PySpark in transforming petabytes of data for ad-hoc analysis and generating downstream queries. In particular, we covered:
If you missed the webinar, you can view it on-demand here and the slides are accessible as attachments to the webinar.
Toward the end, we held a Q & A, and below are all the questions with links to forums with their answers. (Follow the link to view the answers.)
What’s the frequency of how often the raw data is deposited onto S3, and the frequency of your ETL?
What version of Apache Spark did you use and why?
What is actually Databricks helping in? Only working on Apache Spark?
In Pyspark how do we differentiate Dataset from Dataframe?
If you’d like free access to Databricks, you can access the free trial here.