On July 13th, we hosted a live webinar — Accelerate Data Science with Better Data Engineering on Databricks. This webinar focused on the use of PySpark in transforming petabytes of data for ad-hoc analysis and generating downstream queries. In particular, we covered:
- Transforming TBs of data with RDDs and PySpark responsibly
- Using the JDBC connector to write results to production databases seamlessly
- Comparisons with a similar approach using Hive
If you missed the webinar, you can view it on-demand here and the slides are accessible as attachments to the webinar.
Toward the end, we held a Q & A, and below are all the questions with links to forums with their answers. (Follow the link to view the answers.)
I heard that RDD’s are going to disappear in the next version of Apache Spark, and DataFrames will replace RDD’s. Is that true?
Is there a whole pipeline about how you were able to build up the recommendation engine with dealing massive (TBs) of data?
What’s the frequency of how often the raw data is deposited onto S3, and the frequency of your ETL?
What version of Apache Spark did you use and why?
Could you please explain a bit more about the 4 groups involved in the Index calculation (s, G, s.p & G.p)?
What is actually Databricks helping in? Only working on Apache Spark?
Do you think Datasets/DataFrames are faster than RDDs, and can improve the performance in this case or your use case?
In Pyspark how do we differentiate Dataset from Dataframe?
If you’d like free access to Databricks, you can access the free trial here.
Try Databricks for free.