Two months ago, we held a live webinar — Not Your Father’s Database: How to Use Apache Spark Properly in your Big Data Architecture — which covered a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames.
The webinar is accessible on-demand. The slides and sample notebooks are also downloadable as attachments to the webinar. Join the Databricks Community Edition beta to get free access to Apache Spark and try out the notebooks.
We have also answered the common questions raised by webinar viewers below. If you have additional questions, please check out the Databricks Forum.
Common webinar questions and answers
Click on the question to see answer:
- What are the best practices to store files in S3 to enable more efficient Spark access?
- What are the pros and cons between using Spark + HDFS or Spark + S3?
- What is advantage of Spark ML over homegrown Python scikit library?
- Would storing data using Parquet solve most problems for query efficiency? How does Spark SQL take advantage of Parquet?
- Any best practices for structuring or naming files in S3 to enable more efficient Spark access?