Open Source | Databricks Blog

Page 30

Improved Fault-tolerance and Zero Data Loss in Apache Spark Streaming

January 15, 2015 by Tathagata Das in Engineering

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its...

Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform

January 9, 2015 by Michael Armbrust in Engineering

Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...

Announcing Apache Spark Packages

December 22, 2014 by Patrick Wendell in Solutions

Today, we are happy to announce Apache Spark Packages ( http://spark-packages.org ), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages.

Announcing Apache Spark 1.2

December 18, 2014 by Patrick Wendell in Engineering

We at Databricks are thrilled to announce the release of Apache Spark 1.2! Apache Spark 1.2 introduces many new features along with scalability...

Apache Spark Officially Sets a New Record in Large-Scale Sorting

November 5, 2014 by Reynold Xin in Engineering

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system...

Efficient Similarity Algorithm Now in Apache Spark, Thanks to Twitter

October 20, 2014 by Reza Zadeh in Engineering

Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its open-source contribution , with permission...

Apache Spark the Fastest Open Source Engine for Sorting a Petabyte

October 10, 2014 by Reynold Xin in Engineering

Update November 5, 2014 : Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the Daytona GraySort...

Apache Spark as a platform for large-scale neuroscience

October 1, 2014 by Jeremy Freeman in Engineering

The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions...

Apache Spark 1.1: MLlib Performance Improvements

September 22, 2014 by Burak Yavuz in Engineering

With an ever-growing community, Apache Spark has had it’s 1.1 release . MLlib has had its fair share of contributions and now supports...

Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark

September 17, 2014 by Nick Pentreath and Kan Zhang in Engineering

This is a guest post by Nick Pentreath of Graphflow and Kan Zhang of IBM , who contributed Python input/output format support to Apache Spark 1.1. Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverag