Engineering | Databricks Blog

Page 58

Databricks Delta: A Unified Data Management System for Real-time Big Data

October 24, 2017 by Michael Armbrust, Bill Chambers and Matei Zaharia in Platform

Combining the best of data warehouses, data lakes and streaming For an in-depth look and demo, join the webinar . Today we are...

Introducing the Natural Language Processing Library for Apache Spark

October 19, 2017 by David Talby in Solutions

This is a community blog and effort from the engineering team at John Snow Labs, explaining their contribution to an open-source Apache Spark...

Using Databricks to Democratize Big Data and Machine Learning at McGraw-Hill Education

October 18, 2017 by Matthew Hogan in Engineering

This is a guest post from Matt Hogan, Sr. Director of Engineering, Analytics and Reporting at McGraw-Hill Education. McGraw-Hill Education is a 129-year-old...

Arbitrary Stateful Processing in Apache Spark’s Structured Streaming

October 17, 2017 by Bill Chambers and Jules Damji in Engineering

This is the seventh post in a multi-part series about how you can perform complex streaming analytics using Apache Spark and Structured Streaming...

Benchmarking Structured Streaming on Databricks Runtime Against State-of-the-Art Streaming Systems

October 11, 2017 by Burak Yavuz in Engineering

Update Dec 14, 2017 : As a result of a fix in the toolkit’s data generator, Apache Flink's performance on a cluster of...

Accelerating R Workflows on Databricks

October 6, 2017 by Hossein Falaki in Engineering

At Databricks we strive to make our Unified Analytics Platform the best place to run big data analytics. For big data, Apache Spark...

Building Complex Data Pipelines with Unified Analytics Platform

October 5, 2017 by Jules Damji and Jason Pohl in Platform

Introduction Big data practitioners often post recurring questions on Quora: What is data engineering? How to become a data scientist? What’s a data...

Do your Streaming ETL at Scale with Apache Spark’s Structured Streaming

September 1, 2017 by Tathagata Das in Announcements

At the Spark Summit in San Francisco in June , we announced that Apache Spark’s Structured Streaming is marked as production-ready and shared...

Cost Based Optimizer in Apache Spark 2.2

August 31, 2017 by Ron Hu, Zhenhua Wang, Wenchen Fan and Sameer Agarwal in Engineering

This is a joint engineering effort between Databricks’ Apache Spark engineering team (Sameer Agarwal and Wenchen Fan) and Huawei’s engineering team (Ron Hu...

Developing Custom Machine Learning Algorithms in PySpark

August 30, 2017 by Ajay Saini and Joseph Bradley in Engineering

Developing custom Machine Learning (ML) algorithms in PySpark—the Python API for Apache Spark—can be challenging and laborious. In this blog post, we describe...