Open Source | Databricks Blog

Page 29

Improvements to Kafka integration of Spark Streaming

March 30, 2015 by Cody Koeninger, Davies Liu and Tathagata Das in Engineering

Apache Kafka is rapidly becoming one of the most popular open source stream ingestion platforms. We see the same trend among the users...

Topic modeling with LDA: MLlib meets GraphX

March 25, 2015 by Joseph Bradley in Engineering

Topic models automatically infer the topics discussed in a collection of documents. These topics can be used to summarize and organize documents, or...

What's new for Spark SQL in Apache Spark 1.3

March 24, 2015 by Michael Armbrust in Engineering

Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...

Using MongoDB with Apache Spark

March 20, 2015 by Matt Kalan in Engineering

Update August 4th 2016: Since this original post, MongoDB has released a new Databricks-certified connector for Apache Spark. See the updated blog post...

Announcing Apache Spark 1.3!

March 13, 2015 by Patrick Wendell in Engineering

Today I’m excited to announce the general availability of Apache Spark 1.3! Apache Spark 1.3 introduces the widely anticipated DataFrame API, an evolution...

Introducing DataFrames in Apache Spark for Large Scale Data Science

February 16, 2015 by Reynold Xin, Michael Armbrust and Davies Liu in Engineering

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When...

Apache Spark: A review of 2014 and looking ahead to 2015 priorities

February 13, 2015 by Patrick Wendell and Matei Zaharia in Engineering

2014 has been a year of tremendous growth for Apache Spark. It became the most active open source project in the Big Data...

An introduction to JSON support in Spark SQL

February 2, 2015 by Yin Huai in Engineering

Note: Starting Spark 1.3, SchemaRDD will be renamed to DataFrame. In this blog post, we introduce Spark SQL’s JSON support, a feature we have been working on at Databricks to make it dramatically easier to query and create JSON data in Spark. With the prevalence of web and mobile applications, JSON has become the de-facto interchange format for web service API’s as well as long-term storage. With existing tools, users often engineer complex pipelines to read

Introducing Streaming k-means in Apache Spark 1.2

January 28, 2015 by Jeremy Freeman in Engineering

Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in...

Random Forests and Boosting in MLlib

January 21, 2015 by Joseph Bradley and Manish Amde in Engineering

This is a post written together with Manish Amde from Origami Logic. Apache Spark 1.2 introduces Random Forests and Gradient-Boosted Trees (GBTs) into...