Processing Geospatial Data at Scale With Databricks
The evolution and convergence of technology has fueled a vibrant marketplace for timely and accurate geospatial data. Every day billions of handheld and IoT devices along with thousands of airborne and satellite remote sensing platforms generate hundreds of exabytes of location-aware data. This boom of geospatial big data combined with advancements in machine learning is...
Streamlining Variant Normalization on Large Genomic Datasets with Glow
Cross posted from the Glow blog. Many research and drug development projects in the genomics world involve large genomic variant data sets, the volume of which has been growing exponentially over the past decade. However, the tools to extract, transform, load (ETL) and analyze these data sets have not kept pace with this growth. Single-node...
New Databricks Integration for Jupyter Bridges Local and Remote Workflows
Introduction For many years now, data scientists have developed specific workflows on premises using local filesystem hierarchies, source code revision systems and CI/CD processes. On the other side, the available data is growing exponentially and new capabilities for data analysis and modeling are needed, for example, easily scalable storage, distributed computing systems or special hardware...
Migration from Hadoop to modern cloud platforms: The case for Hadoop alternatives
Companies rely on their big data and analytics platforms to support innovation and digital transformation strategies. However, many Hadoop users struggle with complexity, unscalable infrastructure, excessive maintenance overhead and overall, unrealized value. We help customers navigate their Hadoop migrations to modern cloud platforms such as Databricks and our partner products and solutions, and in this...
Deep Learning Tutorial Demonstrates How to Simplify Distributed Deep Learning Model Inference Using Delta Lake and Apache Spark™
On October 10th, our team hosted a live webinar—Simple Distributed Deep Learning Model Inference—with Xiangrui Meng, Software Engineer at Databricks. Model inference, unlike model training, is usually embarrassingly parallel and hence simple to distribute. However, in practice, complex data scenarios and compute infrastructure often make this "simple" task hard to do from data source to...
Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions
Introduction In the post Using AutoML Toolkit to Automate Loan Default Predictions, we had shown how the Databricks Labs’ AutoML Toolkit simplified Machine Learning model feature engineering and model building optimization (MBO). It also had improved the area-under-the-curve (AUC) from 0.6732 (handmade XGBoost model) to 0.723 (AutoML XGBoost model). With AutoML Toolkit’s Release 0.6.1, we...
Scalable near real-time S3 access logging analytics with Apache Spark™ and Delta Lake
The original blog is from Viacheslav Inozemtsev, Senior Data Engineer at Zalando, reproduced with permission. Introduction Many organizations use AWS S3 as their main storage infrastructure for their data. Moreover, by using Apache Spark™ on Databricks they often perform transformations of that data and save the refined results back to S3 for further analysis. When...
Scaling Hyperopt to Tune Machine Learning Models in Python
Try the Hyperopt notebook to reproduce the steps outlined below and watch our on-demand webinar to learn more. Hyperopt is one of the most popular open-source libraries for tuning Machine Learning models in Python. We’re excited to announce that Hyperopt 0.2.1 supports distributed tuning via Apache Spark. The new SparkTrials class allows you to scale out...
Scaling Financial Time Series Analysis Beyond PCs and Pandas: On-Demand Webinar, Slides and FAQ Now Available!
On Oct 9th, 2019, we hosted a live webinar —Scaling Financial Time Series Analysis Beyond PCs and Pandas — with Junta Nakai, Industry Leader Financial Services at Databricks, and Ricardo Portilla, Solution Architect at Databricks. This was a live webinar showcasing the content in this blog- Democratizing Financial Time Series Analysis with Databricks. Please find...
Managed MLflow Now Available on Databricks Community Edition
In February 2016, we introduced Databricks Community Edition, a free edition for big data developers to learn and get started quickly with Apache Spark. Since then our commitment to foster a community of developers remains steadfast: to date, we have over 150K registered Community Edition users; we have trained thousands of people at meetups and...