Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-Source Spark

Download Slides

Booz Allen is at the forefront of cyber innovation and sometimes that means applying AI in an on-prem environment because of data sensitivity. As many of our clients want to apply data science in operations, the team at Booz Allen had to find appropriate solutions. Booz Allen’s innovative Cyber AI team will take you through an on-prem implementation of Databricks Runtime Environment compared to Open Source Spark, how we were able to get 10x performance gains on real-world cyber workloads and some of the difficulties of setting up an on-prem, air-gapped solution for data analytics.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hey, hi there. This is Justin Hoffman. I am with Booz Allen and Hamilton and I’m coming to you from Texas. So I’m happy to be here and presenting to all of you on Spark vs. Spark. So that is Spark Open-Source for Spark DBR, in an on-prem environment. So as I said, Justin Hoffman, I am a senior lead data scientist at Booz Allen Hamilton and I am going on nine years at Booz Allen. And also want to say a special thanks to the US Air Force for allowing us to collaborate with them and solve real world hard problems. And also, a special thanks to David Brooks as well for collaborating with us to solve some of our technical problems as we are going through our research. And let’s get started. Booz Allen Hamilton has been solving client problems for over 100 years. We are actually at 27,000 employees now, with a revenue of 7 billion for FY20. And what we do at the fundamental level of Booz Allen is consulting services. And we do a lot of technology and a lot of great work for all of our clients to support them in any of their endeavors. You can see that pie chart there and where our team sits is within the defense section of national defense. And what I am going to be talking to you today is one of our client problems where we have been doing research and development in collaboration with them, to solve more of a cyber problem using analytics.

So five of our capabilities at Booz Allen, as I said fundamentally we are a consulting firm that was founded by Edwin Booz. And we grew from there to add sections like analytics, cyber digital solutions and engineering. So what I am going to talk about is analytics and how it’s applied to cyber. And how we support national defense with cyber analysts. And how we are doing that in an on-prem environment with no internet and in enclave environments and what that looks like and what a difficult challenge that is sometimes in how Spark can kind of come through for us. So, cyber is a very complex challenge and it stems that the average intrusion to detection is about 200 days. Which is quite a long time in the big scheme of things, but there is a reason why. But there’s a reason why such a long time because it is highly complex. There is a lot of data feeds coming from millions of devices. Large corporations have OT, IT and run of the mill Windows or Lennox servers or all of those things, all of those are attack surfaces that are opportunities for adversaries to get into your network. And so suffice it to say if there’s a lot of data in cyber as well. And so whenever you’re the cyber analyst is going through all of these alerts and looking for adversaries in a network that those things that are anomalies it actually takes a lot of time and a lot of trade craft to identify routes sourced and chase down adversaries in a network. So, one thing that we want to focus on as part of our research and development is speed. So speed is very important to an analyst. Obviously whenever you have 200 days on average that you’re trying to analyze something, or maybe you are a threat hunter that arrives on mission to find a potential adversary or just, you know lock down an environment. It’s important to have speed and it’s important to have all of the gear that you need in order to successfully do your job. So a part of our r and d focused on how do we apply machine learning at scale in an on-prem environment where there is no internet connection and you have some horse power there on the hardware but what does that look like and is it effective in, oh by the way, how do we compare that to an Open-Source version of Spark vs. the Spark DBR version? So that’s were we kind of focused here. And there has also been reports out there that some of the nation state actors the nation state adversaries are getting in and gaining initial access to a computer and pivoting to another computer in less that 20 minutes. And so not only has it gone from 200 days of detection from intrusion to detection, but now in some cases, some of the more sophisticated adversaries can do it in sometimes 20 minutes. So speed is paramount.

The Challenge: Go Fast… On-Premise?

So, this graphic here is kind of a, I would say an over view of the data science problem in how Booz Allen kind of looks at the data science process. So, we have a bunch of data sources that are from a bunch of different areas of a network. It could be proprietary sources, it could any data source anywhere. PCAP data, Zeek files, any of those things and so what we want to do is collect that data and we want to wrangle it and process it and aggregate it, into things that we can understand in a common data framework, common data model. That then we can expose that information by either enriching it or applying machine learning and ultimately it arrives at the cyber analyst’s desk where, ideally they have everything at their fingertips and they can bubble up all of those insights to the very top, and so they can spend the majority of their time on the key things that they need to focus on. So this is more of a higher level process, but I would say 80%, even 90% of our time in any data science is time that’s spent between collection process and aggregation. And so whenever you take a look at doing things on premise where terabytes of PCAP is coming off of a network, you have to have a data pipeline that can collect that information and process it and do so in a rapid amount of time and at scale. And so whenever you get to the expose, kind of bubble of this process, that’s where machine learning takes place where it’s running on top of Spark or running on top of a distributed cluster, so that you can take your models from local environments to production scale and hopefully make a huge impact on cyber security. So this next slide here, this is data science frame work, data science proximate is applied to a cyber problem and so just as I was kind of mentioning you have data coming in from various sensors on the left, you have some sort of data broker there kind of towards the middle that is doing some sort of churn of what it means to collect the data, process it, normalize it, enrich it and then put it into a storage mechanism for later analysis by the analyst.

Solution: A Service-Oriented Architecture for Capability Deployment

So the normalization engine is a methodology where you have a common data framework, common data model where any cyber data you can fit it into some sort of categorization or medidata management of information about the data you’re collecting. And so the join AI center has done a really great job about figuring out a common data model for this cyber data and that model is then impactful for doing machine learning and having proper labels for any enrichment. So during the enrichment phase, we have various various, machine learning models because there is not one model to rule them all if you will. And we apply machine learning on DGA attacks. We can do different random force models and we want to apply all those at scale with the idea that the output, or the probability of that recommendation will then give the analyst insight on whether or not that particular method is an indicator of attack or indicator of compromise. We also have other threat intel feeds that we like to add into that enrichment engine, where we can take hashes of different files and send it to something like Virustotal or any API thing that you can think of to create a story about all of those endpoints about the potential initial access for an adversary. And then ultimately after all of that hard work is done we get down to the analyst. Where the analyst then has the hard job of going through and looking through all the different indicators of a compromise and hopefully has data that’s been wrapped in stacks from top to bottom of the time that they should probably spend at the very very high likelihood of an attack. And so you can use a bunch of different various tools and that kind of methodology. That’s kind of how Booz Allen thinks about these kinds of things. So as far as our research and development, and what we wanted to do, is we wanted to go fast. So we wanted to figure out how can we leverage Delta Lake and Spark DBR to kind of cut off a lot of the excess, if you will and only prove out that Spark Open-Source and Spark DBR, there is huge optimizations to be gathered there. And I think that is kind of what we have been successful at. So this next graphic here kind of shows more of a stripped down version of that process of more of the research and development process of focusing on leveraging Spark SQL, to find IPs that are of interest.

Project Architecture: Focused on High Performance Computing

And we can gather, we can correlate and gather all sorts of information on that IP using the SQL language that’s embedded. And then under the hood, we have Spark Open-Source vs. Spark DBR and the big question there was does it matter when we move to on premise whether or not we have Spark Open-Source or Spark DBR? And you know, in fact it does matter. Right? Basically we have, and we’ll get into this later, but DBR does provide large optimizations when doing Spark SQL and looking for different IPs, doing complex joins and also we get advantages from machine learning as well for whenever we apply machine learning models to at scale in a non-premise environment.

So I’ll talk more and at length about Spark, but let’s kind of focus on Delta Lake here for a minute. Initially when we had done our research, we started with Zeek logs, that were coming from PCAP data, raw, real data. And we put that into Zeek files. Then we ingested that and put that into parquet. Put parquet into the dup and then we eventually did the Spark analysis, right. So that was kind of our pipeline and when working with Databricks, they put us onto the Delta Lake format and all the optimizations possible out of there. So that really made a lot of sense for us at the data broker’s stage because whenever you have six worker nodes and you know you have a lot of data coming in. I think we had about a terabyte or more of data. We wanted to make sure that we were trying to squeeze out as much optimization as possible. And so Delta Lake really provided that where with DBIO caching and the MariaDB, we were able to get orders of magnitude optimized over the plain parquet files.

So it’s a little bit more cumbersome to work in a on-premise environment than it is in cloud if you will. And in this really helps to figure out, to kind of get you there a lot faster, and to, whenever ethernet cables and gigabits speeds actually matter whenever deploying the N’ware containers and virtualized environments in allocating memory and having to do trade-offs between memory. That’s a high performance computing piece that does actually matter when you are doing on premise kinds of stuff. And a lot of that is abstracted away for you in the cloud and so whenever you are running Spark on premise, it really helps to have a lot of that knowledge for the trade offs on what you can or can’t do. I think that we had iterated quite a few times on how much memory to give each of the worker nodes, how best to connect things into hadoop, which it was a great learning experience in all research and development is for really.

So kind of moving on, we’ll explore, some of the results for Spark Open-Source and Spark DBR, well obviously, so in the cloud, we at a minimum we can give 5X faster.

Results: Spark Open Source vs Spark DBR

That picture there on the left was taken from Databricks website, their selves, where in the cloud, based upon Spark DBR vs. Spark Open-Source on the AWS, at least you get 5X faster. And so what does that mean to an on premise environment and what does that mean to how to deploy machine learning in do that at scale on an on premise environment. So there wasn’t really a whole lot, I would say data out there, at lease we felt, so that’s kind of what kicked a lot of this question off is can we do that same thing and get those performance gains that you would see in the cloud in a more closed off enclave on premise?

So as you can see on the graph there on the right, biggest performance gains were from the SQL filtering and SQL joins on data that had been parse, that had been, had model machine learning applied to the data. And then taking an IP that was of interest basically replicating what an analyst would do, and using SQL joins to go and find that IP across terabytes and billions of records is no easy task. Right? And so the more complex the join got, the more optimization we got. So a more rudimentary reading count kind of SQL query returned about 4.6X.

But whenever we did a filtered count of a SQL, and so we are aggregating maybe two different tables, we are counting, we are doing things. We even saw 43X of return optimization using DBR over the Spark Open-Source version. And so that’s groundbreaking to us, when doing Spark on premise because that means that the cyber analyst, whenever they’re taking in all of these billions of alerts coming from millions of devices, they can now go find and IP and an adversary of threat and get 50X return on optimization if they’re using Spark DBR over Open-Source. So that was quite an eye-opening to us, and to the clients we support. In one of the things that I wanted to mention here, we see decision tree here is not a whole lot of optimization there. One of the things that I wanted to mention is that there are probably better ways that we could have coded on some of the machine learning pieces too. So there is like MLflow, that we had, that’s part of our future work and

having user defined functions executed properly within our own machine learning model to make sure that we can even boost up those performance gains on DBR, whenever we are performing the machine learning at scale. So if you can kind of see there, a million records or more, 43X in return if you choose go with Spark DBR for an on premise deployment. And it possible to deploy DBR on premise, and you don’t have to necessarily use Open-Source Spark. Some of the lessons learned, that I wanted to get into.

We have Spark DBR and Delta Lake obvious up to 50X depending on what kind of join you are doing. Really important for the analyst and IP of interest.

So whenever we did neural network, classification with DBR, we were still able to see a little bit more than 4X.

And if we had, if we in the future work when we deploy our neural networks we’ll make sure that we are doing it in an optimized method. But really exciting to see deep learning deployed on premise on Spark and doing it on a a real client data. So, we also experienced some Open-Sourced, some failures from the worker nodes.

So initially we thought it was Spark Open-Source that was failing when some of our big data jobs wouldn’t finish but it turned out that it was our distribution of hadoot. And so lesson learned there is to also check your hadoot distribution and maybe use a different type of distribution that is more maintained by a Open-Source community. And that way maybe you won’t experience worker nodes just dying off and not completing jobs. We also thought that leveraging Data Lake in the format with parquet and Maria was key as well because you get, you definitely get more optimization over any of the RDDs.

And that opens a lot more research for us for how do we ingest data at scale and how do we do

it streaming to provide the best possible user interface for any of the cyber analysts and enable our partners to threat hunt effectively. So look forward to all of your questions and again thanks for attending this talk.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Justin Hoffman

Booz Allen Hamilton

Justin Hoffman is a Senior Lead Data Scientist at Booz Allen Hamilton. He has over 8 years of experience in the analytics field developing custom solutions and 13 years of experience in the US Army. Mr. Hoffman currently leads an internal R&D project for Booz Allen in the field of applied Artificial Intelligence for Cybersecurity. He holds a B.S. in Mechanical Engineering from UTSA, multiple certifications, and recently completed 3 journal papers in Deep Learning applied to the fields of steganography and GANs. In addition, Mr. Hoffman currently has 1 patent in Biomedical Analytics for an electrolytic biosensor and 2.