Managing big data stored on ADLSgen2/Databricks may be challenging. Setting up security, moving or copying the data of Hive tables or their partitions may be very slow, especially when dealing with hundreds of thousands of files. Procter & Gamble developed a framework (to be open-sourced before the conference), which takes performance of these operations to the next level. By leveraging Apache Spark parallelism, low level file system operations, as well as multithreading within the tasks, we managed to reduce time needed to manage ADLS files by >10x. Finally, ADLS files security management can be done by any Data Engineer without profound understanding of ADLS REST API. It also provides new capabilities to Apache Spark applications, to easily move files/folders/tables/partitions with just a line of code. This presentation will show problems, which we are solving using this framework as well as previous solutions, which did not work well. Next we will present in details how this problem was solved using Spark API and what higher level methods are available in the framework. We will walk through available options and planned extensions to the library.
Procter & Gamble
Procter and Gamble Advanced Analytics Lead Solution Architect with previous experience in traditional Data Warehousing. Passionate of getting the most out of available tools, solution architecture and development. In daily job working on complex ETL/ML projects on Azure/Databricks architecture.