Hosted cloud

Fully managed Spark clusters available in just seconds with a few clicks.
Learn more

Immediate answers

Built in applications help you find answers within minutes of connecting to your data sources.
Learn more

Powered by Spark

An open source processing engine that combines blazing speed with sophisticated analytics in a single easy-to-use system.
Learn more

Latest blog posts

See all

Application Spotlight: Faimdata

October 27, 2014

This post is guest authored by our friends at Faimdata, whose Consumer Data Intelligence Service is now “Certified on Spark.” Forecasting, Analytics, Intelligence, Machine Learning Faimdata’s Consumer Data Intelligence Service is a turnkey Big Data solution that provides comprehensive infrastructure and applications to retailers. We help our clients form close connections with their customers and make timely business decisions, using their existing data sources. The unified data processing pipeline deployed by Faimdata has three core focuses. They are (i) our Personalization Service that identifies the personal preferences and buying behaviors of each individual consumer using recommendation/machine learning algorithms; (ii) our Data Analytic Workbench where clients execute high performance multi-dimensional analytics across all distributed data sources using pre-defined or ad-hoc SQL-like languages; and (iii) our Social Intelligence Engine where clients can monitor social media related to their brands, products and... Read more

Spark Summit East – CFP now open

October 23, 2014

The call for presentations for the inaugural Spark Summit East is now open. Please join us in New York City on March 18-19, 2015 to share your experience with Spark and celebrate its growing community. Spark Summit East is looking for presenters who would like to showcase how Spark and its related technologies are used in applications, development, data science and research. Please visit our submission page for additional details. The Deadline for submissions is December 5, 2014 at 11:59pm PST. Spark Summit East is the leading event for Apache Spark users, developers and vendors. It is an exciting opportunity to meet analysts, researchers, developers and executives interested in utilizing Spark technology to answer big data questions. If you missed Spark Summit 2014, all the content is available online for free. Read more

Efficient similarity algorithm now in Spark, thanks to Twitter

October 20, 2014

Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its open-source contribution, with permission. The associated pull request is slated for release in Spark 1.2. Introduction We are often interested in finding users, hashtags and ads that are very similar to one another, so they may be recommended and shown to users and advertisers. To do this, we must consider many pairs of items, and evaluate how “similar” they are to one another. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” We have developed a new efficient algorithm to solve the similarity join called “Dimension Independent Matrix Square using MapReduce,” or DIMSUM for short, which made one of Twitter’s most expensive batch computations 40% more efficient. To describe the problem we’re trying to solve more formally,... Read more

Application Spotlight: Tableau Software

October 15, 2014

This post is guest authored by our friends at Tableau Software, whose visual analytics software is now “Certified on Spark.” Spark – The Next Big Innovation Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing frameworks. For many years, MapReduce and the Hadoop open-source platform served as an effective foundation for the distributed processing of large data sets. Then last year, the introduction of YARN provided the resource manager needed to enable interactive workloads, bringing data processing performance to another level. However, as organizations entrust big data platforms to handle more of their critical business information, the volume and variety of data will continue to grow rapidly as will the need for speed to insight and action on that data. As most of the community... Read more
See all blog posts