Open Source | Databricks Blog

Page 6

How Data Lakehouses Solve Common Issues With Data Warehouses

February 4, 2021 by Ryan Boyd in Engineering

Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...

Ray & MLflow: Taking Distributed Machine Learning Applications to Production

February 3, 2021 by Amog Kamsetty and Archit Kulkarni in Engineering

This is a guest blog from software engineers Amog Kamsetty and Archit Kulkarni of Anyscale and contributors to Ray.io In this blog post...

Strategies for Modernizing Investment Data Platforms

January 29, 2021 by Ricardo Portilla in Engineering

The appetite for investment was at a historic high in 2020 for both individual and institutional investors. One study showed that "retail traders...

Burning Through Electronic Health Records in Real Time With Smolder

January 28, 2021 by Ryan DeCosmo and Frank Austin Nothaft in Engineering

Check out the solution accelerator to download the notebook referred throughout this blog. In previous blogs , we looked at two separate workflows...

How to Manage Python Dependencies in PySpark

December 22, 2020 by Hyukjin Kwon in Engineering

Controlling the environment of an application is often challenging in a distributed computing environment - it is difficult to ensure all nodes have...

Natively Query Your Delta Lake With Scala, Java, and Python

December 22, 2020 by Shixiong Zhu, Scott Sandre and Denny Lee in Engineering

Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and...

Python Autocomplete Improvements for Databricks Notebooks

December 15, 2020 by Richard Fung, Xinrong Meng, Takuya Ueshin, Hyukjin Kwon and Austin Ford in Engineering

At Databricks, we strive to provide a world-class development experience for data scientists and engineers, and new features are constantly getting added to...

How to Train XGBoost With Spark

November 16, 2020 by Stephen Offer in Data Science and ML

XGBoost is currently one of the most popular machine learning libraries and distributed training is becoming more frequently required to accommodate the rapidly...

Improving the Spark Exclusion Mechanism in Databricks

November 6, 2020 by Tianhan Hu, Xingbo Jiang and Xiao Li in Engineering

Ed Note: This article contains references to the term blacklist, a term that the Spark community is actively working to remove from Spark...

Faster SQL: Adaptive Query Execution in Databricks

October 21, 2020 by MaryAnn Xue and Allison Wang in Engineering

Earlier this year, Databricks wrote a blog on the whole new Adaptive Query Execution framework in Spark 3.0 and Databricks Runtime 7.0. The...