Apache Spark™ has rapidly emerged as the de facto standard for big data processing and data sciences across all industries. The use cases range from providing recommendations based on user behavior to analyzing millions of genomic sequences to accelerate drug innovation and development for personalized medicine.
Our engineers, including the team that started the Spark research project at UC Berkeley which later became Apache Spark, continue to drive Spark development to make these transformative use cases a reality. Through the Databricks Blog, they regularly highlight new Spark releases and features, provide technical tutorials on Spark components, in addition to sharing practical implementation tools and tips.
This e-book, the first of a series, offers a collection of the most popular technical blog posts written by leading Spark contributors and members of the Spark PMC including Matei Zaharia, the creator of the Spark research project at UC Berkeley; Reynold Xin, Spark’s chief architect; Michael Armbrust, who is the architect behind Spark SQL; Xiangrui Meng and Joseph Bradley, the drivers of Spark MLlib; and Tathagata Das, the lead developer behind Spark Streaming, just to name a few.
These blog posts highlight many of the major developments designed to make Spark analytics simpler including:
- Section 1: An Introduction to the Apache Spark APIs for Analytics
- Section 2: Tips and Tricks in Data Import
- Section 3: Real-World Case Studies of Spark Analytics with Databricks
Included within this eBook are recently created Databricks notebooks in Python, Scala, SQL, R, and Markdown that will help you experiment and visualize with Apache Spark Analytics. If you do not have access to Databricks, sign up for Databricks Community Edition for free!
Whether you are just getting started with Spark or are already a Spark power user, this e-book will arm you with the knowledge to be successful on your next Spark project.