Leveling the Playing Field: HorovodRunner for Distributed Deep Learning Training
This is a guest post authored by Sr. Staff Data Scientist/User Experience Researcher Jing Pan and Senior Data Scientist Wendao Liu of leading health insurance marketplace eHealth. None generates Taichi; Taichi generates two complementary forces; Two complementary forces generate four aggregates; Four aggregates generate eight trigrams; Eight trigrams determine myriads of phenomena. —Classic of Changes...
Data Access Governance and 3 Signs You Need it
This is a guest authored post by Heather Devane, content marketing manager, Immuta. Cloud data analytics is only as powerful as the ability to access that data for use. Yet, the data stewards responsible for managing data governance often find themselves in a holding pattern, waiting for approval from various stakeholders to operationalize data assets...
Over 200K Enrolled in Databricks’ Certification and Training
More than 200,000 individuals have participated in Databricks' certification and training over the past four years, including thousands of partners. In the past year alone, over 75,000 individuals have been trained and over 1,500 customers and partners have also earned their Databricks Academy Certifications. Today, we are pleased to announce new digital badges so you...
Lakehouse Architecture Realized: Enabling Data Teams With Faster, Cheaper and More Reliable Open Architectures
Databricks was founded under the vision of using data to solve the world’s toughest problems. We started by building upon our open source roots in Apache Spark™ and creating a thriving collection of projects, including Delta Lake, MLflow, Koalas and more. We’ve now built a company with over 1,500 employees helping thousands of data teams...
Bayesian Modeling of the Temporal Dynamics of COVID-19 Using PyMC3
In this post, we look at how to use PyMC3 to infer the disease parameters for COVID-19. PyMC3 is a popular probabilistic programming framework that is used for Bayesian modeling. Two popular methods to accomplish this are the Markov Chain Monte Carlo (MCMC) and Variational Inference methods. The work here looks at using the currently...
How to Manage Python Dependencies in PySpark
Controlling the environment of an application is often challenging in a distributed computing environment - it is difficult to ensure all nodes have the desired environment to execute, it may be tricky to know where the user’s code is actually running, and so on. Apache Spark™ provides several standard ways to manage dependencies across the...
Natively Query Your Delta Lake With Scala, Java, and Python
Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and Python (via the Delta Rust API). Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch...
Personalizing the Customer Experience with Recommendations
Go directly to the Recommendation notebooks referenced throughout this post. Retail made a giant leap forward in the adoption of e-commerce in 2020, E-commerce as a percentage of total retail saw multiple years of progress in one year. Meanwhile, COVID, lockdowns and economic uncertainty have completely disrupted how we engage and retain customers. Companies need...
A Step-by-step Guide for Debugging Memory Leaks in Spark Applications
This is a guest authored post by Shivansh Srivastava, software engineer, Disney Streaming Services. It was originally published on Medium.com Just a bit of context We at Disney Streaming Services use Apache Spark across the business and Spark Structured Streaming to develop our pipelines. These applications run on the Databricks Runtime(DBR) environment which is quite...
Top Questions from Our Lakehouse Event
We recently held a virtual event, featuring CEO Ali Ghodsi, that showcased the vision of Lakehouse architecture and how Databricks helps customers make it a reality. Lakehouse is a data platform architecture that implements similar data structures and data management features to those in a data warehouse directly on the low-cost, flexible storage used for...