On-Demand Webinar and FAQ: Accelerate Data Science with Better Data Engineering on Databricks

Platform blog

Published: July 31, 2017

Product2 min read

by Andrew Candela, Senior Data Engineer at MediaMath and Jules Damji

On July 13th, we hosted a live webinar — Accelerate Data Science with Better Data Engineering on Databricks. This webinar focused on the use of PySpark in transforming petabytes of data for ad-hoc analysis and generating downstream queries. In particular, we covered:

Transforming TBs of data with RDDs and PySpark responsibly
Using the JDBC connector to write results to production databases seamlessly
Comparisons with a similar approach using Hive

If you missed the webinar, you can view it on-demand here and the slides are accessible as attachments to the webinar.

Toward the end, we held a Q & A, and below are all the questions with links to forums with their answers. (Follow the link to view the answers.)

I heard that RDD's are going to disappear in the next version of Apache Spark, and DataFrames will replace RDD's. Is that true?

Is there a whole pipeline about how you were able to build up the recommendation engine with dealing massive (TBs) of data?

What’s the frequency of how often the raw data is deposited onto S3, and the frequency of your ETL?

What version of Apache Spark did you use and why?

Could you please explain a bit more about the 4 groups involved in the Index calculation (s, G, s.p & G.p)?

What is actually Databricks helping in? Only working on Apache Spark?

Do you think Datasets/DataFrames are faster than RDDs, and can improve the performance in this case or your use case?

In Pyspark how do we differentiate Dataset from Dataframe?

If you’d like free access to Databricks, you can access the free trial here.

What's next?

How to present and share your Notebook insights in AI/BI Dashboards

November 21, 2024/3 min read

How to present and share your Notebook insights in AI/BI Dashboards

Scale Faster with Data + AI: Insights from the Databricks Unicorns Index

December 9, 2024/6 min read

Scale Faster with Data + AI: Insights from the Databricks Unicorns Index