New Features in Machine Learning Pipelines in Spark 1.4

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows.  Spark’s latest release, Spark 1.4, significantly extends the ML library.  In this post, we highlight  several new features in the ML Pipelines API, including: A stable API --- Pipelines have graduated from Alpha! New feature transformers Additional…

Read

Using 3rd Party Libraries in Databricks: Spark Packages and Maven Libraries

In an earlier post, we described how you can easily integrate your favorite IDE with Databricks to speed up your application development. In this post, we will show you how to import 3rd party libraries, specifically Spark Packages, into Databricks by providing Maven coordinates. Background on Spark Packages Spark Packages (http://spark-packages.org) is a community package…

Read

Yesware Deploys Production Data Pipeline in Record Time with Databricks

We are happy to announce that Yesware chose Databricks to build its production data pipeline, completing the project in record time -- in just under three weeks. Press release: http://www.marketwired.com/press-release/yesware-deploys-production-data-pipeline-in-record-time-with-databricks-2041188.htm Yesware, the leading sales acceleration software for sales teams at major enterprise companies such as eBay, New Relic, and IBM, enables sales professionals to have highly effective and…

Read

Be Heard with the Spark Survey

At Databricks, we are constantly working to improve Apache Spark. To help us and the Spark community, we would love to hear from you to help set Spark’s future direction. A recent example of the community helping to direct Spark would be SparkR. As noted in the Datanami article Python Versus R in Apache Spark, we were bombarded with…

Read

Joint Blog Post: Bringing ORC Support into Apache Spark

This is a joint blog post with our partner Hortonworks. Zhan Zhang is a member of technical staff at Hortonworks, where he collaborated with the Databricks team on this new feature. In version 1.2.0, Apache Spark introduced a Data Source API (SPARK-3247) to enable deep platform integration with a larger number of data sources and sinks.…

Read

Introducing Window Functions in Spark SQL

In this blog post, we introduce the new window function feature that was added in Spark 1.4. Window functions allow users of Spark SQL to calculate results such as the rank of a given row or a moving average over a range of input rows. They significantly improve the expressiveness of Spark’s SQL and DataFrame…

Read

Introducing R Notebooks in Databricks

Spark 1.4 was released on June 11 and one of the exciting new features was SparkR. I am happy to announce that we now support R notebooks and SparkR in Databricks, our hosted Spark service. Databricks lets you easily use SparkR in an interactive notebook environment or standalone jobs. R and Spark nicely complement each…

Read

Announcing SparkHub: A Community Site for Apache Spark

Today, we are happy to announce SparkHub (http://sparkhub.databricks.com), a service for the Apache Spark™ community to easily find the most relevant Spark resources on the web. SparkHub contains the latest news about Spark, newest videos of Spark talks, most recent Spark packages, and upcoming Spark events around the world.  Want to find the next Spark…

Read

New Visualizations for Understanding Spark Streaming Applications

Earlier, we presented new visualizations introduced in Spark 1.4.0 to understand the behavior of Spark applications. Continuing the theme, this blog highlights new visualizations introduced specifically for understanding Spark Streaming applications. We have updated the Streaming tab of the Spark UI to show the following: Timelines and statistics of events rates, scheduling delays and processing…

Read

Guest blog: PMML Support in Spark MLlib

This is a guest blog from our friend Vincenzo Selvaggio who contributed this feature. He is a Senior Java Technical Architect and Project Manager, focusing on delivering advanced business process solutions for investment banks. The recently released Apache Spark 1.4 introduces PMML support to MLlib for linear models and k-means clustering. This achievement is the…

Read