Engineering Blog | Databricks Blog

Page 70

Random Forests and Boosting in MLlib

January 21, 2015 by Joseph Bradley and Manish Amde in Engineering Blog

This is a post written together with Manish Amde from Origami Logic. Apache Spark 1.2 introduces Random Forests and Gradient-Boosted Trees (GBTs) into...

Improved Fault-tolerance and Zero Data Loss in Apache Spark Streaming

January 15, 2015 by Tathagata Das in Engineering Blog

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its...

Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform

January 9, 2015 by Michael Armbrust in Engineering Blog

Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...

ML Pipelines: A New High-Level API for MLlib

January 6, 2015 by Joseph Bradley, Evan Sparks and Shivaram Venkataraman in Engineering Blog

MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in...

Announcing Apache Spark Packages

December 22, 2014 by Patrick Wendell in Solutions

Today, we are happy to announce Apache Spark Packages ( http://spark-packages.org ), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages.

Announcing Apache Spark 1.2

December 18, 2014 by Patrick Wendell in Engineering Blog

We at Databricks are thrilled to announce the release of Apache Spark 1.2! Apache Spark 1.2 introduces many new features along with scalability...

Pearson uses Apache Spark Streaming for next generation adaptive learning platform

December 8, 2014 by Dibyendu Bhattacharya in Company Blog

This is a guest blog post from our friends at Pearson outlining their Apache Spark use case. Introduction of Pearson Pearson is a...

Apache Spark Officially Sets a New Record in Large-Scale Sorting

November 5, 2014 by Reynold Xin in Engineering Blog

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system...

Efficient Similarity Algorithm Now in Apache Spark, Thanks to Twitter

October 20, 2014 by Reza Zadeh in Engineering Blog

Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its open-source contribution , with permission...

Apache Spark the Fastest Open Source Engine for Sorting a Petabyte

October 10, 2014 by Reynold Xin in Engineering Blog

Update November 5, 2014 : Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the Daytona GraySort...