Using 3rd Party Libraries in Databricks: Apache Spark Packages and Maven Libraries
In an earlier post, we described how you can easily integrate your favorite IDE with Databricks to speed up your application development. In this post, we will show you how to import 3rd party libraries, specifically Apache Spark packages, into Databricks by providing Maven coordinates. Background on Spark Packages Spark Packages (http://spark-packages.org) is a community...
Making Databricks Better for Developers: IDE Integration
We have been working hard at Databricks to make our product more user-friendly for developers. Recently, we have added two new features that will allow developers easily use external libraries - both their own and 3rd party packages - in Databricks. We will showcase these features in a two-part series. Here is part 1, introducing how...
Statistical and Mathematical Functions with DataFrames in Apache Spark
We introduced DataFrames in Apache Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python, DataFrames in Spark expose an API that’s similar to the single-node data tools that data scientists are already familiar with. Statistics is an important part of everyday data science. We are happy...
Apache Spark 1.1: MLlib Performance Improvements
With an ever-growing community, Apache Spark has had it’s 1.1 release. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast and tree aggregation. Torrent broadcast...
Statistics Functionality in Apache Spark 1.1
One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate various components of a data pipeline.
Scalable Collaborative Filtering with Apache Spark MLlib
Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Python (Scala/Java APIs also available).