Accelerating developers by ditching the data center
Guest blog by R Tyler Croy, Director of Platform Engineering at Scribd People don’t tend to get excited about the data platform. It is often regarded much like road infrastructure: nobody thinks much about how vital it is for them to get from points A to B, unless it’s terribly bad. Imagine my surprise when...
Monitor Your Databricks Workspace with Audit Logs
Cloud computing has fundamentally changed how companies operate - users are no longer subject to the restrictions of on-premises hardware deployments such as physical limits of resources and onerous environment upgrade processes. With the convenience and flexibility of cloud services comes challenges on how to properly monitor how your users utilize these conveniently available resources....
Announcing a New Redash Connector for Databricks
We’re happy to introduce a new, open source connector with Redash, a cloud-based SQL analytics service, to make it easy to query data lakes with Databricks. Traditionally, data analyst teams face issues with stale and partial data compromising the quality of their work, and want to be able to connect to the most complete and...
Spark + AI Summit is now a global virtual event
Extraordinary times call for extraordinary measures. That’s why we transformed this year’s Spark + AI Summit into a fully virtual experience and opened the doors to welcome everyone, free of charge. This gives us the opportunity to turn Summit into a truly global event, bringing together tens of thousands of data scientists, engineers and analysts...
Solving the World’s Toughest Problems with the Growing Open Source Ecosystem and Databricks
We started Databricks in 2013 in a tiny little office in Berkeley with the belief that data has the potential to solve the world’s toughest problems. We entered 2020 as a global organization with over 1000 employees and a customer base spanning from two-person startups to Fortune 10s. In this blog post, let’s take a...
Scalable Near Real-Time S3 Access Logging Analytics with Apache Spark™ and Delta Lake
The original blog is from Viacheslav Inozemtsev, Senior Data Engineer at Zalando, reproduced with permission. Introduction Many organizations use AWS S3 as their main storage infrastructure for their data. Moreover, by using Apache Spark™ on Databricks they often perform transformations of that data and save the refined results back to S3 for further analysis. When...
Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs
We are excited to announce the release of Delta Lake 0.4.0 which introduces Python APIs for manipulating and managing data in Delta tables. The key features in this release are: Python APIs for DML and utility operations (#89) - You can now use Python APIs to update/delete/merge data in Delta Lake tables and to run...
Parallelizing SAIGE Across Hundreds of Cores
As population genetics datasets grow exponentially, it is becoming impractical to work with genetic data without leveraging Apache Spark™. There are many ways to use Spark to derive novel insights into the role of genetic variation on disease processes. For example, Regeneron works directly on Spark SQL DataFrames, and the open-source Hail package can be...
A Guide to Training Sessions at Spark + AI Summit, Europe
Education and the pursuit of knowledge are lifelong journeys: they never complete; there is always something new to learn; a new professional certification to add to your credit; a knowledge gap to fill. Training at Spark + AI Summit, Europe is not only about becoming an Apache Spark expert. Nor is it only about being...
Diving Into Delta Lake: Schema Enforcement & Evolution
Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the world must adapt to new data, some of which contains new dimensions - new ways of seeing things we had no conception of before. These mental models are not unlike a table's schema, defining how we categorize and...