Engineering | Databricks Blog

Page 71

Sharethrough Uses Apache Spark Streaming to Optimize Advertisers' Return on Marketing Investment

October 7, 2014 by Russell Cardullo in Company

This is a guest blog post from our friends at Sharethrough providing an update on how their use of Apache Spark has continued...

Apache Spark as a platform for large-scale neuroscience

October 1, 2014 by Jeremy Freeman in Engineering

The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions...

Scalable Decision Trees in MLlib

September 29, 2014 by Manish Amde and Joseph Bradley in Engineering

This is a post written together with one of our friends at Origami Logic. Origami Logic provides a Marketing Intelligence Platform that uses...

Apache Spark 1.1: MLlib Performance Improvements

September 22, 2014 by Burak Yavuz in Engineering

With an ever-growing community, Apache Spark has had it’s 1.1 release . MLlib has had its fair share of contributions and now supports...

Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark

September 17, 2014 by Nick Pentreath and Kan Zhang in Engineering

This is a guest post by Nick Pentreath of Graphflow and Kan Zhang of IBM , who contributed Python input/output format support to Apache Spark 1.1. Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverag

Apache Spark 1.1: The State of Spark Streaming

September 16, 2014 by Tathagata Das and Patrick Wendell in Engineering

With Apache Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components - Spark...

Announcing Apache Spark 1.1

September 11, 2014 by Patrick Wendell in Engineering

Today we’re thrilled to announce the release of Apache Spark 1.1! Apache Spark 1.1 introduces many new features along with scale and stability improvements. This post will introduce some key features of Apache Spark 1.1 and provide context on the priorities of Spark for this and the next release.

Statistics Functionality in Apache Spark 1.1

August 27, 2014 by Doris Xin, Burak Yavuz and Hossein Falaki in Engineering

One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate v

Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao

August 14, 2014 by Andy Huang and Wei Wu in Engineering

This is a guest blog post from our friends at Alibaba Taobao. Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Apache Spark to analyze these enormous amounts of data. Alibaba Taobao probably runs some of the largest Spark jobs in the world. For example, some Spark jobs run for weeks to perform feature extraction on petabytes of image data. In this blog post, we share our

Scalable Collaborative Filtering with Apache Spark MLlib

July 22, 2014 by Burak Yavuz and Reynold Xin in Engineering

Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Pyt