Fokko Driesprong

Principal Code Connoisseur, GoDataDriven

Principal Code Connoisseur at GoDataDriven, is a data processing enthusiast and loves functional programming (preferably Scala). As a data engineering consultant, he helps companies to develop data-driven products. Next to his consulting work, he contributes to a variety of open-source projects. Among others, he’s a committer on the Apache {Airflow, Avro, Parquet, Druid} projects and contributes to Apache {Spark, Flink, Superset, …}.

Past sessions

Summit Europe 2020 3D: DBT using Databricks and Delta

November 17, 2020 04:00 PM PT

Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I'll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I'll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.

Speaker: Fokko Driesprong

Summit Europe 2017 Working with Skewed Data: The Iterative Broadcast

October 24, 2017 05:00 PM PT

Skewed data is the enemy when joining tables using Spark. It shuffles a large proportion of the data onto a few overloaded nodes, bottlenecking Spark's parallelism and resulting in out of memory errors. The go-to answer is to use broadcast joins; leaving the large, skewed dataset in place and transmitting a smaller table to every machine in the cluster for joining. But what happens when your second table is too large to broadcast, and does not fit into memory? Or even worse, when a single key is bigger than the total size of your executor? Firstly, we will give an introduction into the problem. Secondly, the current ways of fighting the problem will be explained, including why these solutions are limited. Finally, we will demonstrate a new technique - the iterative broadcast join - developed while processing ING Bank's global transaction data. This technique, implemented on top of the Spark SQL API, allows multiple large and highly skewed datasets to be joined successfully, while retaining a high level of parallelism. This is something that is not possible with existing Spark join types.
Session hashtag: #EUde11