Spark Summit East – CFP now open

The call for presentations for the inaugural Spark Summit East is now open. Please join us in New York City on March 18-19, 2015 to share your experience with Spark and celebrate its growing community. Spark Summit East is looking for presenters who would like to showcase how Spark...
Read

Efficient similarity algorithm now in Spark, thanks to Twitter

Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its open-source contribution, with permission. The associated pull request is slated for release in Spark 1.2. Introduction We are often interested in finding users, hashtags and ads that are very similar to...
Read

Application Spotlight: Tableau Software

This post is guest authored by our friends at Tableau Software, whose visual analytics software is now “Certified on Spark.” Spark – The Next Big Innovation Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing...
Read

Spark the fastest open source engine for sorting a petabyte

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that...
Read

Application Spotlight: Trifacta

This post is guest authored by our friends at Trifacta after having their data transformation platform “Certified on Spark.” Today we announced v2 of the Trifacta Data Transformation Platform, a release that emphasizes the important role that Hadoop plays in the new big data enterprise architecture. With Trifacta v2...
Read

Sharethrough Uses Spark Streaming to Optimize Advertisers’ Return on Marketing Investment

This is a guest blog post from our friends at Sharethrough providing an update on how their use of Spark has continued to expand. Business Challenge Sharethrough is an advertising technology company that provides native, in-feed advertising software to publishers and advertisers. Native, in-feed ads are designed to match...
Read

Spark as a platform for large-scale neuroscience

The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions of neurons somehow work together to endow organisms with the extraordinary ability to interact with the world around them. Things our brains control effortlessly — kicking...
Read

Scalable Decision Trees in MLlib

This is a post written together with one of our friends at Origami Logic. Origami Logic provides a Marketing Intelligence Platform that uses Spark for heavy lifting analytics work on the backend. Decision trees and their ensembles are industry workhorses for the machine learning tasks of classification and regression....
Read

Apache Spark Improves the Economics of Video Distribution at NBC Universal

This is a guest blog post from our friends at NBC Universal outlining their Spark use case. Business Challenge NBC Universal is one of the world’s largest media and entertainment companies with revenues of US$ 26 billion. It operates television networks, cable channels, motion picture and television production companies...
Read

Databricks Reference Applications

At Databricks, we are often asked how to go beyond the basic Spark tutorials and start building real applications with Spark.  As a result, we are developing reference applications on github to demonstrate that.  We believe this is a great way to learn Spark, and we plan on incorporating...
Read

Spark 1.1: MLlib Performance Improvements

With an ever-growing community, Spark has had it’s 1.1 release. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast...
Read

Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark

This is a guest post by Nick Pentreath of Graphflow and Kan Zhang of IBM, who contributed Python input/output format support to Spark 1.1. Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output...
Read

Spark 1.1: The State of Spark Streaming

With Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components – Spark Streaming – and highlight who is using Spark Streaming and why. Spark 1.1. adds several new features to Spark Streaming.  In particular, Spark Streaming extends its library...
Read

Application Spotlight: Talend

This post is guest authored by our friends at Talend after having Talend Studio “Certified on Spark.” As the move to the next generation of integration platforms grows momentum, the need to implement a proven and scalable technology is critical. Databricks and Spark, delivered on the major Hadoop distributions,...
Read

Announcing Spark 1.1

Today we’re thrilled to announce the release of Spark 1.1! Spark 1.1 introduces many new features along with scale and stability improvements. This post will introduce some key features of Spark 1.1 and provide context on the priorities of Spark for this and the next release. In the next...
Read

Statistics Functionality in Spark 1.1

One of our philosophies in Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of...
Read

Mining Ecommerce Graph Data with Spark at Alibaba Taobao

This is a guest blog post from our friends at Alibaba Taobao. Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Spark to analyze these enormous amounts of data. Alibaba Taobao probably runs some of the...
Read

When Stratio Met Spark: A True Love Story

This is a guest post from our friends at Stratio announcing that their platform is now a “Certified Spark Distribution”. Certified distribution Stratio is delighted to announce that it is officially a Certified Spark Distribution. The certification is very important for us because we deeply believe that the certification...
Read

Scalable Collaborative Filtering with Spark MLlib

Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company’s customer...
Read

Spark Summit 2014 Highlights

From June 30 to July 2, 2014 we held the second Spark Summit, a conference focused on promoting the adoption and growth of Apache Spark. This was an exciting year for the Spark community and we are proud to share some highlights. 1,164 participants from over 453 companies attended Spark...
Read

Distributing the Singular Value Decomposition with Spark

Guest post by Li Pu from Twitter and Reza Zadeh from Databricks on their recent contribution to Spark’s machine learning library. The Singular Value Decomposition (SVD) is one of the cornerstones of linear algebra and has widespread application in many real-world modeling situations. Problems such as recommender systems, linear...
Read

The State of Apache Spark in 2014

This post originally appeared in insideBIGDATA and is reposted here with permission. With the second Spark Summit behind us, we wanted to take a look back at our journey since 2009 when Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting...
Read

New Features in MLlib in Spark 1.0

MLlib is a Spark component focusing on machine learning. It became a standard component of Spark in version 0.8 (Sep 2013). The initial contribution was from Berkeley AMPLab. Since then, 50+ developers from the open source community have contributed to its codebase. With the release of Spark 1.0, I’m glad...
Read

Databricks Cloud: Making Big Data Easy

Our vision at Databricks is to make big data easy so that we enable every organization to turn its data into value. At Spark Summit 2014, we were very excited to unveil Databricks Cloud, our first product towards fulfilling this vision. In this post, I’ll briefly go over the...
Read

Shark, Spark SQL, Hive on Spark, and the future of SQL on Spark

With the introduction of Spark SQL and the new Hive on Spark effort (HIVE-7292), we get asked a lot about our position in these two projects and how they relate to Shark. At the Spark Summit today, we announced that we are ending development of Shark and will focus...
Read

Integrating Spark and HANA

This morning SAP released its own “Certified Spark Distribution” as part of a brand new partnership announced between Databricks and SAP. We’re thrilled to be embarking on this journey with them, not just because of what it means for Databricks as a company, but just as importantly because of...
Read

Databricks Announces Partnership with SAP

SAN FRANCISCO — July 1, 2014 — Databricks, the company founded by the creators of Apache Spark – the popular open-source processing engine – today announced a new partnership with SAP (NYSE: SAP) and to deliver a Databricks-certified Apache Spark distribution offering for the SAP HANA® platform. The full...
Read

Databricks Unveils Spark-Based Cloud Platform; Announces Series B Funding

Databricks Cloud Allows Users to Get Value from Spark without the Challenges Normally Associated with Big Data Infrastructure Ease-of-Use of Turnkey Solution Brings the Power of Spark to a Wider Audience and Fuels the Growth of the Spark Ecosystem Funding Led by NEA with Follow-on Investment from Andreessen Horowitz...
Read

Sparkling Water = H20 + Spark

This post is guest authored by our friends at 0xData discussing the release of Sparkling Water – the integration of their H20 offering with the Spark platform. H20 – The Killer-App on Spark In-memory big data has come of age. The Spark platform, with its elegant API, provides a...
Read

Application Spotlight: Pentaho

This post is guest authored by our friends at Pentaho after having their data integration and analytics platform “Certified on Spark.” Spark on Fire! Integrating Pentaho and Spark to Deliver Next-generation Big Data Analytic Solutions One of Pentaho’s great passions is to empower organizations to take advantage of amazing...
Read

Application Spotlight: Elasticsearch

This post is guest authored by our friends at Elasticsearch announcing Elasticsearch is now “Certified on Spark”, the first step in a collaboration to provide tighter integration between Elasticsearch and Spark. Elasticsearch Now “Certified on Spark” Helping businesses get insights out of their data, fast, is core to the...
Read

Databricks Launches “Certified Spark Distribution” Program

Certified distributions maintain compatibility with open source Apache Spark distribution and thus support the growing ecosystem of Spark applications BERKELEY, Calif. — June 26, 2014 – Databricks, the company founded by the creators of Apache Spark, the next generation Big Data engine, today announced the “Certified Spark Distribution” program...
Read

Application Spotlight: Qlik

This post is guest authored by our friends at Qlik describing how Spark enables the full power of QlikView, recently Certified on Spark, and its Associative Experience feature over the entire HDFS data set. The Power of Qlik Qlik provides software and services that help make understanding data a...
Read

Application Spotlight: Apervi

This post is guest authored by our friends at Apervi after having their Conflux Director™ application be “Certified on Spark”. Big Data on Steroids with Spark As big data takes center stage in the new data explosion, Hadoop has emerged as one the leading technologies addressing the challenges in...
Read

Application Spotlight: Typesafe

This post is guest authored by our friends at Typesafe after having their Typesafe Activator Spark templates be “Certified on Spark”. Apache Spark and the Typesafe Reactive Platform: A Match Made in Heaven When I started working with Hadoop several years ago, it was frustrating to find that writing...
Read

Spark Summit 2014 Brings Together Apache Spark Community

Three-Day Event in San Francisco Invites Attendees to Gain Insights from the Leading Organizations in Big Data Keynote Speakers Include Executives from Databricks, Cloudera, MapR, DataStax, Jawbone and More Spark Summit Features Different Tracks for Applications, Development, Data Science and Research   BERKELEY, Calif.–(BUSINESS WIRE)– Databricks and the sponsors...
Read

Application Spotlight: Adatao

This post is guest authored by our friends at Adatao describing why and how they bet on Spark. In early 2012, a group of engineers with background in distributed systems and machine learning came together to form Adatao (a-’DAY-tao). We saw a major unsolved problem in the nascent Hadoop...
Read

MicroStrategy “Certified on Spark”

This post is guest authored by our friends at MicroStrategy describing why they’re excited to have their platform “Certified on Spark”. The Need for Speed Over the past few years, we have seen Hadoop emerge as an effective foundation for many organizations’ big data management frameworks, but as the...
Read

Exciting Performance Improvements on the Horizon for Spark SQL

With Apache Spark 1.0 out the door, we’d like to give a preview of the next major initiatives in the Spark project. Today, the most active component of Spark is Spark SQL – a tightly integrated relational engine that inter-operates with the core Spark API. Spark SQL was released...
Read

Databricks Announces Spark Training Workshops

Databricks is excited to launch its training program, starting with a series of hands-on Spark workshops designed by the creators and maintainers of Apache Spark. The first workshop, Introduction to Apache Spark, establishes the fundamentals of using Spark for data exploration, analysis, and building big data applications. This one day workshop is hands-on, covering topics such as: interactively working with Spark’s core...
Read

Announcing Spark 1.0

Today, we’re very proud to announce the release of Apache Spark 1.0. Spark 1.0 is a major milestone for the Spark project that brings both numerous new features and strong API compatibility guarantees. The release is also a huge milestone for the Spark developer community: with more than 110...
Read

Pivotal Hadoop Integrates the Full Apache Spark Stack

This post is guest authored by our friends at Pivotal describing why they’re excited to deliver Apache Spark on their world class Pivotal HD big data analytics platform suite. Today, we are excited to announce the immediate availability of the full Apache Spark stack on Pivotal HD. We have...
Read

Application Spotlight: Atigeo xPatterns

This post is guest authored by our friends at Atigeo announcing the certification of their xPatterns offering. Here at Atigeo, we are always looking for ways to build on, improve, and expand our big data analytics platform, Atigeo xPatterns. More than that, both our development and product management team...
Read

Databricks and Datastax

Today, Datastax and Databricks announced a partnership in which Apache Spark becomes an integral part of the Datastax offering, tightly integrated with Cassandra. We’re very excited to be embarking on this journey with Datastax for a multitude of reasons: Integrating operational systems with analytics One of the use cases...
Read

Databricks Partners with Simba to Deliver Shark ODBC Driver

VANCOUVER, BC. – April 30, 2014 – Simba Technologies Inc., the industry’s expert for Big Data connectivity, announced today that Databricks has licensed Simba’s ODBC Driver as its standards-based connectivity solution for Shark, the SQL front-end for Apache Spark, the next generation Big Data processing engine. Founded by the...
Read

Databricks Application Spotlight at Spark Summit 2014

At Databricks, we’ve been thrilled to see the rapid pace of adoption of Spark, as it has been embraced by an increasing number of enterprise vendors and has grown to be the most active open source project in the Hadoop ecosystem. We also know that a critical piece of...
Read

Making Spark Easier to Use in Java with Java 8

One of Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s...
Read

Databricks and MapR

Today, MapR announced that it will distribute and support the Apache Spark platform as part of the MapR Distribution for Hadoop in partnership with Databricks. We’re thrilled to start on this journey with MapR for a multitude of reasons. One of our primary goals at Databricks is to drive...
Read

MapR Integrates the Complete Spark Stack

This post is guest authored by our friends at MapR, announcing our new partnership to provide enterprise support for Spark as part of MapR’s Distribution of Hadoop. With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in...
Read

Spark 0.9.1 Released

We are happy to announce the availability of Spark 0.9.1! This is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release...
Read

Application Spotlight: Alpine Data Labs

This post is guest authored by our friends at Alpine Data Labs, part of the ‘Application Spotlight’ series highlighting innovative applications that are part of the Databricks “Certified on Spark” program. Everyone knows how hard it is to recruit engineers and data scientists in Silicon Valley. At Alpine Data...
Read

Spark SQL: Manipulating Structured Data Using Spark

Building a unified platform for big data analytics has long been the vision of Apache Spark, allowing a single program to perform ETL, MapReduce, and complex analytics. An important aspect of unification that our users have consistently requested is the ability to more easily import data stored in external...
Read

Sharethrough Uses Spark Streaming to Optimize Bidding in Real Time

We’re very happy to see our friends at Cloudera continue to get the word out about Spark, and their latest blog post is a great example of how users are putting Spark Streaming to use to solve complex problems in real time. Thanks to Russell Cardullo and Michael Ruggiero,...
Read

Apache Spark: A Delight for Developers

This article was cross-posted in the Cloudera developer blog. Apache Spark is well known today for its performance benefits over MapReduce, as well as its versatility. However, another important benefit — the elegance of the development experience — gets less mainstream attention. In this post, you’ll learn just a...
Read

Databricks announces “Certified on Spark” Program

BERKELEY, Calif. – March 18, 2014 – Databricks, the company founded by the creators of Apache Spark that is revolutionizing what enterprises can do with Big Data, today announced the Databricks “Certified on Spark” Program for applications built on top of the Apache Spark platform. This program ensures that...
Read

Spark Now a Top-level Apache Project

We are delighted with the recent announcement of the Apache Software Foundation that Spark has become a top-level Apache project. This is a recognition of the fantastic work done by the Spark open source community, which now counts over 140 developers from 30+ companies. In short time, Spark has...
Read

AMPLab updates the Big Data Benchmark

The AMPLab at UC Berkeley, with help from Databricks, recently released an update to the Big Data Benchmark. This benchmark uses Amazon EC2 to compare performance of five popular SQL query engines in the Big Data ecosystem on common types of queries, which can be reproduced through publicly available...
Read

Databricks at the OReilly Strata Conference 2014

The Databricks team is excited to take part in a number of activities throughout the 2014 O’Reilly Strata Conference in Santa Clara. From hands-on training, to office hours, to several talks (including a keynote), there are plenty of chances for attendees to learn how Spark is bringing ease of...
Read

Spark 0.9.0 Released

Our goal with Apache Spark is very simple: provide the best platform for computation on big data. We do this through both a powerful core engine and rich libraries for useful analytics tasks. Today, we are excited to announce the release of Spark 0.9.0. This major release extends Spark’s...
Read

Spark and Hadoop: Working Together

We are often asked how does Apache Spark fits in the Hadoop ecosystem, and how one can run Spark in a existing Hadoop cluster. This blog aims to answer these questions. First, Spark is intended to enhance, not replace, the Hadoop stack. From day one, Spark was designed to...
Read

Spark In MapReduce (SIMR)

Hadoop integration has always been a key goal of Spark and YARN users have long been able to run Spark on YARN. However, up to now, it has been relatively hard to run Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically,...
Read

Spark 0.8.1 Released

We are happy to announce the release of Apache Spark 0.8.1. In addition to performance and stability improvements, this release adds three new features. First, Spark now supports for the newest versions of YARN (2.2+). Second, the standalone cluster manager supports a high-availability mode in which it can tolerate...
Read

Highlights From Spark Summit 2013

Earlier this month we held the first Spark Summit, a conference to bring the Spark community together. We are excited to share some statistics and highlights from the event. 450 participants from over 180 companies attended Participants came from 13 countries Spark training was sold out at 200 participants...
Read

Putting Spark to Use – Fast In-Memory Computing for Your Big Data Applications

[A version of this post appears on the Cloudera Blog.] Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through...
Read

Databricks and Cloudera Partner to Support Spark

Today, Cloudera announced that it will distribute and support Apache Spark. We are very excited about this announcement, and what it brings to the Spark platform and the open source community. So what does this announcement mean for Spark? First, it validates the maturity of the Spark platform. Started...
Read

The Growing Spark Community

This year has seen unprecedented growth in both the user and contributor communities around Apache Spark. This rapid growth validates the tremendous potential of the platform, and shows the great excitement around it. While Spark started as a research project by a few grad students at UC Berkeley in...
Read

Databricks and the Apache Spark Platform

When we announced that the original team behind Apache Spark is starting a company around the project, we got a lot of excited questions. What areas will the company focus on, and what will it mean for the open source project? Today, in our first blog post at Databricks,...
Read