Skip to main content
<
Page 29
>

Improvements to Kafka integration of Spark Streaming

Apache Kafka is rapidly becoming one of the most popular open source stream ingestion platforms. We see the same trend among the users...

Topic modeling with LDA: MLlib meets GraphX

March 25, 2015 by Joseph Bradley in
Topic models automatically infer the topics discussed in a collection of documents. These topics can be used to summarize and organize documents, or...

What's new for Spark SQL in Apache Spark 1.3

March 24, 2015 by Michael Armbrust in
Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...

Using MongoDB with Apache Spark

March 20, 2015 by Matt Kalan in
Update August 4th 2016: Since this original post, MongoDB has released a new Databricks-certified connector for Apache Spark. See the updated blog post...

Announcing Apache Spark 1.3!

March 13, 2015 by Patrick Wendell in
Today I’m excited to announce the general availability of Apache Spark 1.3! Apache Spark 1.3 introduces the widely anticipated DataFrame API, an evolution...

Introducing DataFrames in Apache Spark for Large Scale Data Science

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When...

Apache Spark: A review of 2014 and looking ahead to 2015 priorities

February 13, 2015 by Patrick Wendell and Matei Zaharia in
2014 has been a year of tremendous growth for Apache Spark. It became the most active open source project in the Big Data...

An introduction to JSON support in Spark SQL

February 2, 2015 by Yin Huai in
Note: Starting Spark 1.3, SchemaRDD will be renamed to DataFrame. In this blog post, we introduce Spark SQL’s JSON support, a feature we have been working on at Databricks to make it dramatically easier to query and create JSON data in Spark. With the prevalence of web and mobile applications, JSON has become the de-facto interchange format for web service API’s as well as long-term storage. With existing tools, users often engineer complex pipelines to read

Introducing Streaming k-means in Apache Spark 1.2

January 28, 2015 by Jeremy Freeman in
Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in...

Random Forests and Boosting in MLlib

January 21, 2015 by Joseph Bradley and Manish Amde in
This is a post written together with Manish Amde from Origami Logic. Apache Spark 1.2 introduces Random Forests and Gradient-Boosted Trees (GBTs) into...