Articles by Reynold Xin - Databricks Blog

Page 9

Project Tungsten: Bringing Apache Spark Closer to Bare Metal

April 28, 2015 by Reynold Xin and Josh Rosen in Engineering

In a previous blog post , we looked back and surveyed performance improvements made to Apache Spark in the past year. In this...

Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More

April 24, 2015 by Reynold Xin in Engineering

Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...

Deep Dive into Spark SQL's Catalyst Optimizer

April 13, 2015 by Michael Armbrust, Yin Huai, Cheng Liang, Reynold Xin and Matei Zaharia in Engineering

Check out the Why the Data Lakehouse is Your Next Data Warehouse ebook to discover the inner workings of the Databricks Lakehouse Platform...

Apache Spark 2.0: Rearchitecting Spark for Mobile Platforms

March 31, 2015 by Reynold Xin in Engineering

Yesterday, to celebrate Apache Spark’s 5 year old birthday, we looked back at the history of the project. Today, we are happy to...

Introducing DataFrames in Apache Spark for Large Scale Data Science

February 16, 2015 by Reynold Xin, Michael Armbrust and Davies Liu in Engineering

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When...

Apache Spark Officially Sets a New Record in Large-Scale Sorting

November 5, 2014 by Reynold Xin in Engineering

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system...

Apache Spark the Fastest Open Source Engine for Sorting a Petabyte

October 10, 2014 by Reynold Xin in Engineering

Update November 5, 2014 : Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the Daytona GraySort...

Scalable Collaborative Filtering with Apache Spark MLlib

July 22, 2014 by Burak Yavuz and Reynold Xin in Engineering

Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Pyt

Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark

July 1, 2014 by Reynold Xin in Engineering

With the introduction of Spark SQL and the new Hive on Apache Spark effort ( HIVE-7292 ), we get asked a lot about...

Spark SQL: Manipulating Structured Data Using Apache Spark

March 26, 2014 by Michael Armbrust and Reynold Xin in Engineering

Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...