Hosted cloud

Fully managed Spark clusters available in just seconds with a few clicks.
Learn more

Immediate answers

Built in applications help you find answers within minutes of connecting to your data sources.
Learn more

Spark from its creators

An open source engine that combines blazing speed with sophisticated analytics in a single easy-to-use system.
Learn more

Latest blog posts

See all

Introducing streaming k-means in Spark 1.2

January 28, 2015

Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in a case we are particularly excited about — the firing of large populations of neurons. In these settings, rather than wait for all the data to be acquired before performing our analyses, we can use streaming algorithms to identify patterns over time, and make more targeted predictions and decisions. One simple strategy is to build machine learning models on static data, and then use the learned model to make predictions on an incoming data stream. But what if the patterns in the data are themselves dynamic? That’s where streaming algorithms come in. A key advantage of Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the...

Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction!

January 27, 2015

In partnership with Typesafe, we are excited to see the publication of the survey report representing the largest poll of Spark developers to date. Spark is currently the most active open source project in big data and has been rapidly gaining traction over the past few years. This survey of over 2100 respondents further validates the wide variety of use cases and environments where it is being deployed. The survey results indicate that 13% are already using Spark in production environments with 20% of the respondents with plans to deploy Spark in production environments in 2015, and 31% are currently in the process of evaluating it. In total, the survey covers over 500 enterprises that are using or planning to use Spark in production environments ranging from on-premise Hadoop clusters to public clouds, with data sources including key-value stores,...

Random Forests and Boosting in MLlib

January 21, 2015

This is a post written together with Manish Amde from Origami Logic. Spark 1.2 introduces Random Forests and Gradient-Boosted Trees (GBTs) into MLlib. Suitable for both classification and regression, they are among the most successful and widely deployed machine learning methods. Random Forests and GBTs are ensemble learning algorithms, which combine multiple decision trees to produce even more powerful models. In this post, we describe these models and the distributed implementation in MLlib. We also present simple examples and provide pointers on how to get started. Ensemble Methods Simply put, ensemble learning algorithms build upon other machine learning methods by combining models. The combination can be more powerful and accurate than any of the individual models. In MLlib 1.2, we use Decision Trees as the base models. We provide two ensemble methods: Random Forests and Gradient-Boosted Trees (GBTs). The...

Spark Summit East 2015 Agenda is Now Available

January 20, 2015

We are thrilled to announce the availability of the agenda for Spark Summit East 2015! This inaugural New York City event on March 18-19, 2015 has over thirty jam-packed sessions – offering a combination of longer deep-dive presentations and shorter intensive talks. You will have the opportunity to engage the speakers and your peers in discussion and a cross-pollination of ideas. Want to guarantee your seat now? Don’t forget to register at Spark Summit East. We also have a limited number of rooms at The Sheraton New York Times Square Hotel for our special room rate of $299 before tax. Book now to take advantage of this great low rate! Looking forward to seeing you in New York City in March! Quick Links Event registration – Hotel Booking – Conference Agenda – Important Dates: February 16, 2015: Last day...
See all blog posts