Skip to main content
<
Page 19
>

The Architecture of the Next CERN Accelerator Logging Service

December 14, 2017 by Jakub Wozniak in
This is a community guest blog from Jakub Wozniak , a software engineer and project technical lead at CERN physics laboratory, further expounding...

Introducing Pandas UDF for PySpark

October 30, 2017 by Li Jin in
NOTE: Spark 3.0 introduced a new pandas UDF. You can find more details in the following blog post: New Pandas UDFs and Python...

Introducing the Natural Language Processing Library for Apache Spark

October 19, 2017 by David Talby in
This is a community blog and effort from the engineering team at John Snow Labs, explaining their contribution to an open-source Apache Spark...

Arbitrary Stateful Processing in Apache Spark’s Structured Streaming

October 17, 2017 by Bill Chambers and Jules Damji in
This is the seventh post in a multi-part series about how you can perform complex streaming analytics using Apache Spark and Structured Streaming...

Benchmarking Structured Streaming on Databricks Runtime Against State-of-the-Art Streaming Systems

October 11, 2017 by Burak Yavuz in
Update Dec 14, 2017 : As a result of a fix in the toolkit’s data generator, Apache Flink's performance on a cluster of...

Accelerating R Workflows on Databricks

October 6, 2017 by Hossein Falaki in
At Databricks we strive to make our Unified Analytics Platform the best place to run big data analytics. For big data, Apache Spark...

Building Complex Data Pipelines with Unified Analytics Platform

October 5, 2017 by Jules Damji and Jason Pohl in
Introduction Big data practitioners often post recurring questions on Quora: What is data engineering? How to become a data scientist? What’s a data...

Cost Based Optimizer in Apache Spark 2.2

This is a joint engineering effort between Databricks’ Apache Spark engineering team (Sameer Agarwal and Wenchen Fan) and Huawei’s engineering team (Ron Hu...

Developing Custom Machine Learning Algorithms in PySpark

August 30, 2017 by Ajay Saini and Joseph Bradley in
Developing custom Machine Learning (ML) algorithms in PySpark—the Python API for Apache Spark—can be challenging and laborious. In this blog post, we describe...

Anthology of Technical Assets on Apache Spark's Structured Streaming

August 24, 2017 by Jules Damji in
Older anthologies collated a collection of contributions from various authors around a theme—bounded then as a journal or periodical. Newer anthologies, however, include...