Leveling the Playing Field: HorovodRunner for Distributed Deep Learning Training
This is a guest post authored by Sr. Staff Data Scientist/User Experience Researcher Jing Pan and Senior Data Scientist Wendao Liu of leading health insurance marketplace eHealth. None generates Taichi; Taichi generates two complementary forces; Two complementary forces generate four aggregates; Four aggregates generate eight trigrams; Eight trigrams determine myriads of phenomena. —Classic of Changes...
A Step-by-step Guide for Debugging Memory Leaks in Spark Applications
This is a guest authored post by Shivansh Srivastava, software engineer, Disney Streaming Services. It was originally published on Medium.com Just a bit of context We at Disney Streaming Services use Apache Spark across the business and Spark Structured Streaming to develop our pipelines. These applications run on the Databricks Runtime(DBR) environment which is quite...
Handling Late Arriving Dimensions Using a Reconciliation Pattern
This is a guest community post authored by Chaitanya Chandurkar, Senior Software Engineer in the Analytics and Reporting team at McGraw Hill Education. Special thanks to MHE Analytics team members Nick Afshartous, Principal Engineer; Kapil Shrivastava, Engineering Manager; and Steve Stalzer, VP of Engineering / Analytics and Data Science, for their contributions. Processing facts and...
How Retina Uses Databricks Container Services to Improve Efficiency and Reduce Costs
This is a guest community post authored by Brad Ito, CTO Retina.ai, with contributions by Databricks Customer Success Engineer Vini Jaiswal Retina is the customer intelligence partner that empowers businesses to maximize customer-level profitability. We help our clients boost revenue with the most accurate lifetime value metrics. Our forward-looking, proprietary models predict customer lifetime value...
Enforcing Column-level Encryption and Avoiding Data Duplication With PII
This is a guest post by Keyuri Shah, lead software engineer, and Fred Kimball, software engineer, Northwestern Mutual. Protecting PII (personally identifiable information) is very important as the number of data breaches and records with sensitive information exposed every day are trending upwards. To avoid becoming the next victim and protect users from identity...
How Scribd Uses Delta Lake to Enable the World’s Largest Digital Library
Scribd uses Delta Lake to enable the world’s largest digital library. Watch this discussion with QP Hou, Senior Engineer at Scribd and an Airflow committer, and R Tyler Croy, Director of Platform Engineering at Scribd to learn how they transitioned from legacy on-premises infrastructure to AWS and how they utilized, implemented, and optimized Delta tables...
Key Sessions for Microsoft Azure Customers at Data + AI Summit Europe 2020
Databricks, diamond sponsor Microsoft and Azure Databricks customers to present keynotes and breakout sessions at Data + AI Summit Europe. Data + AI Summit Europe is the free virtual event for data teams — data scientists, engineers and analysts — who will tune in from all over the world to share best practices, discover new...
How to Evaluate Data Pipelines for Cost to Performance
Learn best practices for designing and evaluating cost-to-performance benchmarks from Germany’s #1 weather portal. While we certainly conduct several benchmarks, we know the best benchmark is your queries running on your data. But what are you benchmarking against in your evaluation? The answer seems obvious - cost and integration with your cloud architecture roadmap. We...
Analytics on the Data Lake With Tableau and the Lakehouse Architecture
Over the past two years we’ve seen a number of organizations moving their data work to the cloud. It simplifies access and scales to handle the biggest volumes. At Tableau, we’re all about customer choice and flexibility, and we’ve enabled our customers to move to the cloud faster than ever. Analytics and data science/machine learning...
Media and Entertainment Agenda for Data + AI Summit Europe 2020
Looking for the best Media and Entertainment (M&E) events and sessions at Data + AI Summit Europe 2020 (Nov 17-19) ? Below are some highlights. You can also find all M&E-related sessions, including customer case studies and extensive how-tos, within the event homepage by selecting “Media and Entertainment” from the “Industry” dropdown menu. You can...