How Scribd Uses Delta Lake to Enable the World’s Largest Digital Library
Scribd uses Delta Lake to enable the world’s largest digital library. Watch this discussion with QP Hou, Senior Engineer at Scribd and an Airflow committer, and R Tyler Croy, Director of Platform Engineering at Scribd to learn how they transitioned from legacy on-premises infrastructure to AWS and how they utilized, implemented, and optimized Delta tables...
COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help
Initially published April 14th, 2020; updated April 21st, 2020 With the massive disruption of the current COVID-19 pandemic, many data engineers and data scientists are asking themselves “How can the data community help?" The data community is already doing some amazing work in a short amount of time including (but certainly not limited to) one...
Query Delta Lake Tables from Presto and Athena, Improved Operations Concurrency, and Merge performance
We are excited to announce the release of Delta Lake 0.5.0, which introduces Presto/Athena support and improved concurrency. The key features in this release are: Support for other processing engines using manifest files (#76) - You can now query Delta tables from Presto and Amazon Athena using manifest files, which you can generate using Scala,...
Detecting Financial Fraud at Scale with Decision Trees and MLflow on Databricks
Detecting fraudulent patterns at scale using artificial intelligence is a challenge, no matter the use case. The massive amounts of historical data to sift through, the complexity of the constantly evolving machine learning and deep learning techniques, and the very small number of actual examples of fraudulent behavior are comparable to finding a needle in...
MLflow On-Demand Webinar and FAQ Now Available!
On August 30th, our team hosted a live webinar—Introducing MLflow: Infrastructure for a complete Machine Learning lifecycle—with Matei Zaharia, Co-Founder and Chief Technologist at Databricks. In this webinar, we walked you through MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library...
Another Record-Setting Spark Summit
The lure of San Francisco is indisputable as is its position as the preeminent high-tech hub. On day one of Spark Summit 2016, the largest community event dedicated to Apache Spark, drew more than 2500+ Spark enthusiasts from 720+ companies. Such a draw is a strong testament to Apache Spark’s open source roots, its fast-growing...
An Illustrated Guide to Advertising Analytics
To learn the latest developments in Apache Spark, register today to join the Spark community at Spark Summit in New York City! This is a joint blog with our friend at Celtra. Grega Kešpret is the Director of Engineering. He leads a team of engineers and data scientists to build analytics pipeline and optimization systems...
Spark Summit East 2016 Agenda is now available
This February, join the Apache Spark community in New York City at the New York Midtown Hilton for the second annual Spark Summit East on February 16th-18th! We are happy to announce that the community talks agenda has been finalized and you can find the full list of 60 community talks available at the Spark...
Databricks 2015 Year In Review: Democratizing Access to Data
To learn more about Apache Spark, attend Spark Summit East in New York in Feb 2016. 2015 has been a phenomenal year of growth for both Databricks and the Apache Spark project. In June, we launched general availability (GA) of our cloud platform, the first end-to-end enterprise data platform based on Spark. At the same time,...
Spark Survey 2015 Results are now available
We ran the Spark Survey 2015 this summer to gain insights on how organizations are using Apache Spark. The results of this year’s Spark Survey - reflecting the answers and opinions of over 1,417 respondents representing 842 organizations - strongly indicate the rapid growth of the Spark community and offers valuable insight into the direction...