making big data simple

Hosted cloud

Fully managed Spark clusters available in just seconds with a few clicks.
Learn more

Immediate answers

Built in applications help you find answers within minutes of connecting to your data sources.
Learn more

Powered by Spark

An open source processing engine that combines blazing speed with sophisticated analytics in a single easy-to-use system.
Learn more

Latest blog posts

See all

Efficient similarity algorithm now in Spark, thanks to Twitter

October 20, 2014

Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its open-source contribution, with permission. The associated pull request is slated for release in Spark 1.2. Introduction We are often interested in finding users, hashtags and ads that are very similar to one another, so they may be recommended and shown to users and advertisers. To do this, we must consider many pairs of items, and evaluate how “similar” they are to one another. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” We have developed a new efficient algorithm to solve the similarity join called “Dimension Independent Matrix Square using MapReduce,” or DIMSUM for short, which made one of Twitter’s most expensive batch computations 40% more efficient. To describe the problem we’re trying to solve more formally,... Read more

Application Spotlight: Tableau Software

October 15, 2014

This post is guest authored by our friends at Tableau Software, whose visual analytics software is now “Certified on Spark.” Spark – The Next Big Innovation Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing frameworks. For many years, MapReduce and the Hadoop open-source platform served as an effective foundation for the distributed processing of large data sets. Then last year, the introduction of YARN provided the resource manager needed to enable interactive workloads, bringing data processing performance to another level. However, as organizations entrust big data platforms to handle more of their critical business information, the volume and variety of data will continue to grow rapidly as will the need for speed to insight and action on that data. As most of the community... Read more

Spark the fastest open source engine for sorting a petabyte

October 10, 2014

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes. To evaluate these improvements, we decided to participate in the Sort Benchmark. With help from Amazon Web Services, we participated in the Daytona Gray category, an industry benchmark on how fast... Read more

Application Spotlight: Trifacta

October 9, 2014

This post is guest authored by our friends at Trifacta after having their data transformation platform “Certified on Spark.” Today we announced v2 of the Trifacta Data Transformation Platform, a release that emphasizes the important role that Hadoop plays in the new big data enterprise architecture. With Trifacta v2 we now support transforming data of all shapes and sizes in Hadoop. This means supporting Hadoop-specific data formats as both inputs and outputs in Trifacta v2 – data formats such as Avro, ORC and Parquet. It also means intelligently executing data transformation scripts through not only MapReduce, which was available in Trifacta v1, but also Spark. Trifacta v2 has been officially Certified on Spark by Databricks. Our partnership with Databricks brings the performance and flexibility of the Spark data processing engine to the world of data wrangling. It has been... Read more
See all blog posts