databricks
making big data simple

Hosted cloud

Fully managed Spark clusters available in just seconds with a few clicks.
Learn more

Immediate answers

Built in applications help you find answers within minutes of connecting to your data sources.
Learn more

Powered by Spark

An open source processing engine that combines blazing speed with sophisticated analytics in a single easy-to-use system.
Learn more

Latest blog posts

See all

Spark 1.1: MLlib Performance Improvements

September 22, 2014

With an ever-growing community, Spark has had it’s 1.1 release. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast and tree aggregation. Torrent broadcast The beauty of Spark as a unified framework is that any improvements made on the core engine come for free in its standard components like MLlib, Spark SQL, Streaming, and GraphX. In Spark 1.1, we changed the default broadcast implementation of Spark from the traditional HttpBroadcast to TorrentBroadcast, a BitTorrent like protocol that evens out the load among the driver and the executors. When an object is broadcasted, the driver divides the serialized object into multiple chunks, and broadcasts the chunks to different executors. Subsequently,...

Databricks and O’Reilly Media launch Certification Program for Apache Spark Developers

September 18, 2014

When Databricks was initially founded a little more than a year ago, there was tremendous excitement around Apache Spark, but it was still early days. The project had ~60 contributors over the previous 12 months, and was not yet available commercially. One of our main focus areas since then has been continuing to grow Spark and the community and making it easily accessible for enterprises and users alike. Taking a step back, it’s terrific to see the progress that Spark has made since then. Spark is today the most active open source project in the Big Data ecosystem with over 300 contributors in the last 12 months alone, and is available through several platform vendors, including all of the major Hadoop distributors. The Spark Summit, dedicated to bringing together the Spark community, more than doubled in size a short... Read more

Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark

September 17, 2014

This is a guest post by Nick Pentreath of Graphflow and Kan Zhang of IBM, who contributed Python input/output format support to Spark 1.1. Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python). Built-in Hadoop support means that Spark can work “out of the box” with any data storage system or format that implements Hadoop’s InputFormat and OutputFormat interfaces, including HDFS, HBase, Cassandra, Elasticsearch, DynamoDB and many others, as well as various data serialization formats such as SequenceFiles, Parquet, Avro, Thrift and...

Spark 1.1: The State of Spark Streaming

September 16, 2014

With Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components – Spark Streaming – and highlight who is using Spark Streaming and why. Spark 1.1. adds several new features to Spark Streaming.  In particular, Spark Streaming extends its library of ingestion sources to include Amazon Kinesis, a hosted stream processing engine, as well as to provide high availability for Apache Flume sources.  Moreover, Spark 1.1 adds the first of a set of online machine learning algorithms with the introduction of a streaming linear regression. Many organizations have evolved from exploratory, discovery use cases of big data to use cases that require reasoning on data as it arrives in order to make decisions in real time.  Spark Streaming enables this category of high-value use cases, providing a system for processing... Read more
See all blog posts