September 22, 2014With an ever-growing community, Spark has had it’s 1.1 release. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast and tree aggregation. Torrent broadcast The beauty of Spark as a unified framework is that any improvements made on the core engine come for free in its standard components like MLlib, Spark SQL, Streaming, and GraphX. In Spark 1.1, we changed the default broadcast implementation of Spark from the traditional HttpBroadcast to TorrentBroadcast, a BitTorrent like protocol that evens out the load among the driver and the executors. When an object is broadcasted, the driver divides the serialized object into multiple chunks, and broadcasts the chunks to different executors. Subsequently,...
Fully managed Spark clusters available in just seconds with a few clicks.Learn more
Built in applications help you find answers within minutes of connecting to your data sources.Learn more
Powered by Spark
An open source processing engine that combines blazing speed with sophisticated analytics in a single easy-to-use system.Learn more
Latest blog postsSee all
September 18, 2014When Databricks was initially founded a little more than a year ago, there was tremendous excitement around Apache Spark, but it was still early days. The project had ~60 contributors over the previous 12 months, and was not yet available commercially. One of our main focus areas since then has been continuing to grow Spark and the community and making it easily accessible for enterprises and users alike. Taking a step back, it’s terrific to see the progress that Spark has made since then. Spark is today the most active open source project in the Big Data ecosystem with over 300 contributors in the last 12 months alone, and is available through several platform vendors, including all of the major Hadoop distributors. The Spark Summit, dedicated to bringing together the Spark community, more than doubled in size a short... Read more
September 17, 2014This is a guest post by Nick Pentreath of Graphflow and Kan Zhang of IBM, who contributed Python input/output format support to Spark 1.1. Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python). Built-in Hadoop support means that Spark can work “out of the box” with any data storage system or format that implements Hadoop’s InputFormat and OutputFormat interfaces, including HDFS, HBase, Cassandra, Elasticsearch, DynamoDB and many others, as well as various data serialization formats such as SequenceFiles, Parquet, Avro, Thrift and...
September 16, 2014With Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components – Spark Streaming – and highlight who is using Spark Streaming and why. Spark 1.1. adds several new features to Spark Streaming. In particular, Spark Streaming extends its library of ingestion sources to include Amazon Kinesis, a hosted stream processing engine, as well as to provide high availability for Apache Flume sources. Moreover, Spark 1.1 adds the first of a set of online machine learning algorithms with the introduction of a streaming linear regression. Many organizations have evolved from exploratory, discovery use cases of big data to use cases that require reasoning on data as it arrives in order to make decisions in real time. Spark Streaming enables this category of high-value use cases, providing a system for processing... Read more