Skip to main content

On July 13th, we hosted a live webinar — Accelerate Data Science with Better Data Engineering on Databricks. This webinar focused on the use of PySpark in transforming petabytes of data for ad-hoc analysis and generating downstream queries. In particular, we covered:

  • Transforming TBs of data with RDDs and PySpark responsibly
  • Using the JDBC connector to write results to production databases seamlessly
  • Comparisons with a similar approach using Hive

If you missed the webinar, you can view it on-demand here and the slides are accessible as attachments to the webinar.

Toward the end, we held a Q & A, and below are all the questions with links to forums with their answers. (Follow the link to view the answers.)

I heard that RDD's are going to disappear in the next version of Apache Spark, and DataFrames will replace RDD's. Is that true?

Is there a whole pipeline about how you were able to build up the recommendation engine with dealing massive (TBs) of data?

What’s the frequency of how often the raw data is deposited onto S3, and the frequency of your ETL?

What version of Apache Spark did you use and why?

Could you please explain a bit more about the 4 groups involved in the Index calculation (s, G, s.p & G.p)?

What is actually Databricks helping in? Only working on Apache Spark?

Do you think Datasets/DataFrames are faster than RDDs, and can improve the performance in this case or your use case?

In Pyspark how do we differentiate Dataset from Dataframe?

If you’d like free access to Databricks, you can access the free trial here.

Try Databricks for free

Related posts

10th Spark Summit Sets Another Record of Attendance

June 9, 2017 by Jules Damji and Wayne Chan in
We have assembled a selected collage of highlights from Databricks’ speakers at our 10th Spark Summit, a milestone for Apache Spark community and...

Using Spark Structured Streaming to Scale Your Analytics

This is a guest post from the M Science Data Science & Engineering Team. Modern data doesn't stop growing "Engineers are taught by...

Measuring Advertising Effectiveness with Sales Forecasting and Attribution

October 5, 2020 by Layla Yang and Hector Leano in
Download the notebooks and watch the webinar for this solution accelerator How do you connect the impact of marketing and your ad spend...
See all Company Blog posts