Deep Dive into Spark SQL’s Catalyst Optimizer
Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala's pattern matching and quasiquotes) in a novel way to build an extensible query...
Apache Spark Turns Five Years Old!
Today, we’re celebrating an important milestone for the Apache Spark project -- it’s now been five years since Spark was first open sourced. When we first decided to release our research code at UC Berkeley, none of us knew how far Spark would make it, but we believed we had built some really neat technology...
Apache Spark: A review of 2014 and looking ahead to 2015 priorities
2014 has been a year of tremendous growth for Apache Spark. It became the most active open source project in the Big Data ecosystem with over 400 contributors, and was adopted by many platform vendors - including all of the major Hadoop distributors. Through our ecosystem of products, partners, and training at Databricks, we also...
“Learning Spark” book available from O’Reilly
Today we are happy to announce that the complete Learning Spark book is available from O’Reilly in e-book form with the print copy expected to be available February 16th. At Databricks, as the creators behind Apache Spark, we have witnessed explosive growth in the interest and adoption of Spark, which has quickly become one of...
The State of Apache Spark in 2014
This post originally appeared in insideBIGDATA and is reposted here with permission. With the second Spark Summit behind us, we wanted to take a look back at our journey since 2009 when Apache Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting and extremely gratifying to watch...
Making Apache Spark Easier to Use in Java with Java 8
One of Apache Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s API to transparently support these...
Apache Spark: A Delight for Developers
This article was cross-posted in the Cloudera developer blog. Apache Spark is well known today for its performance benefits over MapReduce, as well as its versatility. However, another important benefit — the elegance of the development experience — gets less mainstream attention. In this post, you’ll learn just a few of the features in Spark...
The Growing Apache Spark Community
This year has seen unprecedented growth in both the user and contributor communities around Apache Spark. This rapid growth validates the tremendous potential of the platform, and shows the great excitement around it. While Spark started as a research project by a few grad students at UC Berkeley in 2009, today over 90 developers from...
Databricks and the Apache Spark Platform
When we announced that the original team behind Apache Spark is starting a company around the project, we got a lot of excited questions. What areas will the company focus on, and what will it mean for the open source project? Today, in our first blog post at Databricks, we’re happy to share some of...