On-Demand Webinar and FAQ: Accelerate Data Science with Better Data Engineering on Databricks
On July 13th, we hosted a live webinar — Accelerate Data Science with Better Data Engineering on Databricks. This webinar focused on the use of PySpark in transforming petabytes of data for ad-hoc analysis and generating downstream queries. In particular, we covered:
- Transforming TBs of data with RDDs and PySpark responsibly
- Using the JDBC connector to write results to production databases seamlessly
- Comparisons with a similar approach using Hive
If you missed the webinar, you can view it on-demand here and the slides are accessible as attachments to the webinar.
Toward the end, we held a Q & A, and below are all the questions with links to forums with their answers. (Follow the link to view the answers.)
What’s the frequency of how often the raw data is deposited onto S3, and the frequency of your ETL?
What version of Apache Spark did you use and why?
What is actually Databricks helping in? Only working on Apache Spark?
In Pyspark how do we differentiate Dataset from Dataframe?
If you’d like free access to Databricks, you can access the free trial here.