Hosted cloud

Fully managed Spark clusters available in just seconds with a few clicks.
Learn more

Immediate answers

Built in applications help you find answers within minutes of connecting to your data sources.
Learn more

Powered by Spark

An open source processing engine that combines blazing speed with sophisticated analytics in a single easy-to-use system.
Learn more

Latest blog posts

See all

Spark Summit East – CFP now open

October 23, 2014

The call for presentations for the inaugural Spark Summit East is now open. Please join us in New York City on March 18-19, 2015 to share your experience with Spark and celebrate its growing community. Spark Summit East is looking for presenters who would like to showcase how Spark and its related technologies are used in applications, development, data science and research. Please visit our submission page for additional details. The Deadline for submissions is December 5, 2014 at 11:59pm PST. Spark Summit East is the leading event for Apache Spark users, developers and vendors. It is an exciting opportunity to meet analysts, researchers, developers and executives interested in utilizing Spark technology to answer big data questions. If you missed Spark Summit 2014, all the content is available online for free. Read more

Efficient similarity algorithm now in Spark, thanks to Twitter

October 20, 2014

Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its open-source contribution, with permission. The associated pull request is slated for release in Spark 1.2. Introduction We are often interested in finding users, hashtags and ads that are very similar to one another, so they may be recommended and shown to users and advertisers. To do this, we must consider many pairs of items, and evaluate how “similar” they are to one another. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” We have developed a new efficient algorithm to solve the similarity join called “Dimension Independent Matrix Square using MapReduce,” or DIMSUM for short, which made one of Twitter’s most expensive batch computations 40% more efficient. To describe the problem we’re trying to solve more formally,... Read more

Application Spotlight: Tableau Software

October 15, 2014

This post is guest authored by our friends at Tableau Software, whose visual analytics software is now “Certified on Spark.” Spark – The Next Big Innovation Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing frameworks. For many years, MapReduce and the Hadoop open-source platform served as an effective foundation for the distributed processing of large data sets. Then last year, the introduction of YARN provided the resource manager needed to enable interactive workloads, bringing data processing performance to another level. However, as organizations entrust big data platforms to handle more of their critical business information, the volume and variety of data will continue to grow rapidly as will the need for speed to insight and action on that data. As most of the community... Read more

Spark the fastest open source engine for sorting a petabyte

October 10, 2014

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes. To evaluate these improvements, we decided to participate in the Sort Benchmark. With help from Amazon Web Services, we participated in the Daytona Gray category, an industry benchmark on how fast... Read more
See all blog posts