Skip to main content
Platform blog

Apache Spark has seen immense growth over the past several years. The size and scale of this Spark Summit is a true reflection of innovation after innovation that has made itself into the Apache Spark project. Hundreds of contributors working collectively have made Spark an amazing piece of the technology powering thousands of organizations, and Databricks has initiated many key efforts in Spark including Project Tungsten, SparkR, Spark SQL and DataFrame APIs, and Structured Streaming and we continue to contribute heavily to the project both with code and fostering the community.

While the blistering pace of innovation moves the project forward, it makes keeping up to date with all these improvements challenging. To solve this problem, we are happy to introduce Spark: The Definitive Guide. In partnership with O’Reilly Media, we will publish this new comprehensive book on Spark later this year. To celebrate the largest Spark Summit ever, we are releasing several chapters for free to the community. Additionally, if you use discount code AUTHD on the O'Reilly site, you can get 50% off the ebook and 40% off of the print edition!

Early Release of Spark: The Definitive Guide
source: O'Reilly

We have strived to write an informative book on Spark, focusing on condensing the community's development knowledge of Apache Spark, for you.

The Parts of the Book

The first few chapters are the “Gentle Introduction to Spark”; the intended audience is anyone from a SQL Analyst to a Data Engineer. This section covers the simple concepts that everyone should understand about Apache Spark as well as provides a tour of different aspects of Spark’s ecosystem.

The second part of the book dives into Spark’s Structured APIs, powered by the Catalyst engine. You’ll see everything from data sources, transformations, DataFrame and Dataset transformations and everything in between. This includes examples in SQL, Python, and Scala for people to follow.

To show the foundation of what DataFrames are actually built on, the third part of the book discusses Spark’s low-level APIs including RDDs for those that need some advanced functionality or need to work with legacy code built on RDDs.

The fourth part of the book is a deep dive into how Spark actually runs on a cluster and discusses some options for optimizations, monitoring, tuning and debugging.

Finally, the fifth and sixth parts are deep dives into Structured Streaming and Machine Learning respectively. We discuss what makes Structured Streaming such a powerful paradigm and the number of tools and algorithms that Spark makes available to end users through MLlib, Spark’s Machine Learning Library. We even include sections on Graph Analysis with GraphFrames and Deep Learning with TensorFrames.

The last part of the book discusses the ecosystem more generally. We discuss how Spark works with different languages, the ecosystem, and the vast community around Spark.

Getting Started

For your preliminary viewing of the book, we are providing a preview copy of the contents of the book for anyone to download and read, free of charge. This sample is the unedited sample of the current Definitive Guide.

We also plan on adding much of this content to the Databricks Documentation so that Databricks customers can always have an up to date reference. We already include extensive notebook examples that you can use to get started with right away; however, we will continue to add to this as we finish the book.