This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake. We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command. We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
Matt loves writing Spark open source code and is the author of the spark-style-guide, spark-daria, quinn, and spark-fast-tests. He's obsessed with eliminating UDFs from codebases, perfecting method signatures of the public interface, and writing readable tests that execute quickly. Matt spends most of his time in Colombia and Mexico and wants to move to Brazil and learn Portuguese soon. He loves dancing and small talk. In a past life, Matt worked as an economic consultant and passed all three Chartered Financial Analyst exams.