Matt loves writing Spark open source code and is the author of the spark-style-guide, spark-daria, quinn, and spark-fast-tests. He’s obsessed with eliminating UDFs from codebases, perfecting method signatures of the public interface, and writing readable tests that execute quickly. Matt spends most of his time in Colombia and Mexico and wants to move to Brazil and learn Portuguese soon. He loves dancing and small talk. In a past life, Matt worked as an economic consultant and passed all three Chartered Financial Analyst exams.
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it's better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it's hard to compact a partitioned lake. Then we'll move on to Delta lakes and explain how they offer cool features on top of what's available in Parquet. We'll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We'll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we'll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We'll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes