Dominique Brezinski is a member of Apple’s Information Security leadership team and principal engineer working with the Threat Response org. He has twenty five years experience in security engineering, with a focus on intrusion detection and incident response systems design and development. Dom has been working with Apache Spark in production since the 0.8 release.
Cyber threat detection and response requires demanding work loads over large volumes of log and telemetry data. A few years ago I came to Apple after building such a system at another FAANG company, and my boss asked me to do it again. I learned a lot from my prior experience using Apache Spark and AWS S3 at massive scale some good patterns, but also some bad patterns and pieces of technology that I wanted to avoid. That year I ran into Michael Armbrust at Spark+AI Summit and described what I wanted to do and a plan to test Databricks as a foundation for the new system. A few months later, while we were in the middle of our proof of concept build out on Databricks, Michael gave me some code they were calling Tahoe. It was the early alpha of what became Delta Lake, and it was exactly what we wanted. We have been running our entire system writing out hundreds of TB of data a day on Delta Lake since the very beginning.
This presentation will cover some of the issues we encountered and things we have learned about operating very large workloads on Databricks and Delta Lake.
Security monitoring and threat response has diverse processing demands on large volumes of log and telemetry data. Processing requirements span from low-latency stream processing to interactive queries over months of data. To make things more challenging, we must keep the data accessible for a retention window measured in years. Having tackled this problem before in a massive-scale environment using Apache Spark, when it came time to do it again, there were a few things I knew worked and a few wrongs I wanted to right. We approached Databricks with a set of challenges to collaborate on: provide a stable and optimized platform for Unified Analytics that allows our team to focus on value delivery using streaming, SQL, graph, and ML; leverage decoupled storage and compute while delivering high performance over a broad set of workloads; use S3 notifications instead of list operations; remove Hive Metastore from the write path; and approach indexed response times for our more common search cases, without hard-to-scale index maintenance, over our entire retention window. This is about the fruit of that collaboration.