MIDAS: Microcluster-Based Detector of Anomalies in Edge Streams

Download Slides

Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? Existing approaches aim to detect individually surprising edges. In this work, we propose MIDAS, which focuses on detecting microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, such as lockstep behavior, including denial of service attacks in network traffic data. MIDAS has the following properties:
(a) it detects microcluster anomalies while providing theoretical guarantees about its false positive probability;
(b) it is online, thus processing each edge in constant time and constant memory, and also processes the data 162 – 644 times faster than state-of-the-art approaches;
(c) it provides 42%-48% higher accuracy (in terms of AUC) than state-of-the-art approaches.

MIDAS finds anomalies or malicious entities in time-evolving graphs. MIDAS can be used to detect intrusions, Denial of Service (DoS), and Distributed Denial of Service (DDoS) attacks. It can also be used to detect fake profiles in Social Networks like Twitter, Facebook, Amazon reviews, and Financial Frauds. MIDAS requires constant memory to detect these anomalies in real-time so as to minimize the harm caused by them. MIDAS is currently being deployed in real-world systems to improve their performance. Different cybersecurity firms have also asked us to tune MIDAS according to their requirements. Different developers have implemented MIDAS in Python, Ruby, Rust, R, and Golang in addition to the C++ version we originally released.

Speaker: Siddharth Bhatia

Transcript

– Hi everyone. So, this is MIDAS. A micro cluster based detector of anomalies in edge streams. So, this work was published at AAAI this year, and it is joint work with my advisor, Brian and Minji and Christos from Carnegie Mellon University, and Kijung is from KAIST. So, let’s proceed. We all know that a lot of, DoS, DDoS kind of attacks intrusions, and these are happening even in big corporations like GitHub and, there was a Twitter attack recently, several Bitcoins got stolen. So I know it’s happening with even high profile users. So, even though Twitter, GitHub, these are some of the most secure companies out there still such kind of attacks are happening. And, they’re very easy to detect because even these are actually bot kind of attacks. So they’re very easy to detect even from a human user. So, machine learning should definitely be able to do it much better. So, that is the motivating application behind this and for intrusion detection, we can just see this as an example where there are three anomalous users who came in and they wanted to attack, all these normal users on the top. And they wanted to, they sent a lot of spurious or suspicious edges and communications. So, we want to detect this kind of sudden behavior where a similar group of malicious, or anomalous entities come in and they are throwing suspicious ideas or communications and this is the scenario which we want to detect. So, we are working with time evolving graphs, time evolving graphs are everywhere like computer networks, or your instant messaging kind of networks like Twitter, Facebook and you have a user product kind of graph for Amazon and other things. So wherever there is a time evolving or a dynamic graph, and you want to detect anomalies there, or the malicious entities, which are acting abnormally. So all those kinds of things, when you want to detect, so MIDAS can be applicable there. And, so, the scenario which we’re considering is an edge stream, I’ll go into the details of what are the different possibilities, after a little while, but an edge stream is when you are receiving an edge at each time instant, and you’re gonna receive multiple edges also at the same time instant, but modeling the same graph, is the same example as an edge stream. So we have, you want to use this as a normal, or the good people kind of users. And A1 to A3 are the anomalies or the malicious entities. And we observed that A1 to A3 come in a very short span of time. For this particular example, they arrive at the same time instant as six, but they didn’t arrive in a very short span of time and we want to detect this kind of anomalous behavior. And we want to do it in real time as soon as possible. So, because the lag between their arriving and their detection. So that is when the most of the harm is done, for example, in the Twitter attack, so most of the money got stolen because this couldn’t be detected very soon. So if such something like this can be detected as soon as possible and even better if it can be detected in real time. So that is amazing, and that is what our goal is in this particular work. And so the roadmap of the presentation, we first formalized the problem, the problem setting, what we were wanting to do, then I discussed the two algorithms of the approaches, which we have in the paper and how they related, what spans in this area and the experiments and how we are kind of contributing this and how the future is going for this particular open source project and what happened in the community. So, we’ll discuss that. So the problem setting is that input is an edge stream “E” and the graph is a time evolving graph. So that is the input and the graph can be directed. It is multi graph so there can be multiple edges between two nodes at the same time instant as well. And it is, we’re handling discrete time instants so that, we can know that, okay this search arrived at this particular time instant, and we’ll be giving as an output an anomaly score for each edge. So for each edge, like in the graph, we’ll be giving an anomaly score there and then, so that is our, what our goal is. So the contributions for this particular work, well, that it detects micro clusters. It gives guarantees on the false positives, like what are the false positive probabilities, So for critical applications, this is quite crucial because you want a guarantee on your false positives. You do not really want those to extend beyond a certain limit and any streaming or an online algorithm, we want the memory to be constant, constant update time and constant memory is really exact, like a necessity for a streaming algorithm, because we cannot afford unbounded memory in a streaming algorithm because we don’t know how many nodes will keep on coming later. So we want to update time and our memory to be constant for each edge. So yeah, let’s go into the algorithms. And so MIDAS, firstly, to just briefly discuss the background so, for this particular case, there can be a lot of time series based methods. So in an offline scenario, like this problem has been quite well explored, but in an online scenario, the memory is bounded and the set of loads is also not fixed. So, we want something else and the naive solution for this particular problem is that we can assume that our entire distribution is Gaussian. So if we assume that the entire distribution list of the past is Gaussian distribution. So for that, we’ll just compute that likelihood at this particular point at the current time instead and, then we just manually see, okay, if the likelihood is less than the threshold, we declare it as an anomaly. The problem with this particular assumption is that we are assuming that our entire past is following a Gaussian distribution. So that is not what happens in real life. So you cannot really make that strong an assumption. So we make a weaker assumption where the mean level of the past is the same as the mean level of the current time. So we assume, suppose you have 1000 edges and 10 time instances, so we just divided in the normal ratio 9:1 which is 900:100. So our past is 900 edges and our current is 100 edges. That is our expected value and you observe, zero and a thousand. So we can easily see like something like this is an anomaly. So if you carefully see the graph, intuitively if you just want to understand this problem, so the difference between the occurrences and the expected regular. So if your occurrences is really huge and the expected value was very low. So, the anomaly score should be proportional to something like that. And if your occurrence is kind of reduced, so, this is our occurrences and this is our expected value the difference as it increases, so we can see that, an anomaly score should be proportional to that and should also increase. So for MIDAS, we used two things. The first is a streaming data structure. So we use sketches for this. Those sketches are really essential for almost all streaming algorithms and all streaming or online algorithms. Sketches, basically bring down a high level ratio space down to a lower dimension and count-min sketch is one of the most popular sketch data structures. So we use two type of CMS or count-min sketch data structures. So in the first, in the previous libraries, we saw that we needed two things. So, it first was an observed value, like getting an approximate observed value and an expected value. So expected, is basically our total divided by our total time. So that is how you get a mean kind of value and the observed will directly be stored in another data structure. So, we see that “s” is the approximate total count and “a” will be storing the approximate current count, which is the observed count and sketches basically give us theoretical guarantees. So when we have the stay hard, which is less than or equal to a + mu, and this, don’t worry about the details. So what it intuitively says is that with probability, at least 1 – epsilon, we cannot tolerate up to mu amount of error. So something like with 99% probability, it will to be able to tolerate around 0.5% error. So something like that, like a theoretical guarantee, we get with using sketches and there’s an Apache data sketches library, which is a really nice and open source implementation. So they’ve been doing a lot of good work in this area, so feel free to check out sketches. And the other thing which we use in MIDAS is combining these sketches with a chi-squared test. So chi-squared test is basically just a, it is a very popular statistic and views a summation of both the categories, the past and the present and intuitively, so this is a score which we use in MIDAS the direction is in the paper. For implementations, I don’t think it’s required, but intuitively, we can just see that “a” was the observed, and “s” was the total. Expecting becomes the expected, something like that, and our score is proportional to that so that can convince you. Slick. So for details, you can definitely feel free to check out the proof and other things. This is the final score which will be taken into consideration and evaluated based on that. And so using these two things, this was MIDAS. So then we incorporate relations in MIDAS, which is the MIDAS-R algorithm. So, for incorporating relations, we incorporate two kinds of relations; the temporal and the spatial relations. So temporal relations allow us the past, also to be incorporated into the current batch. So previously we saw we were just considering the current time instant. So instead of that, we are kind of considering the previous times as well with a diminishing rate. For example, if your current time was ten, time equal to ten, so, we also considered time equal seven, eight, nine, ten, something like that in a temporal relation with the reducing factors. So you consider, time equal ten to be like 100%, time equal to nine, maybe say around 50%, time equal to eight, around 25%, so we diminish it by the same factor. Spatial relations in this, instead of just finding numbers, they just like just AUV, SUV, those were the data structures we were using. So instead of that, Now, we additionally have SU, AU and SV, AV, So, something, for source node and for the destination node. So we’ll be able to see, not just the count between two edges, but rather a count coming out from a source node, but count averages coming into a destination node. So based on that, then we’ll finally take a maximum of all the three scores and give that as the score of the anomaly, the anomaly score of the edge. So after MIDAS-R, so we also discussed the time and memory complexity of both the algorithms. Like I mentioned, streaming or online algorithms need to have a constant space and a constant update time. And because we just do sketches which give a lot of guarantees as well, so they are the W cross D space complexity and the time complexity is not ready. So, W cross D because we have D-hash functions, where every value is hashed D times that is the number of times it’s hashed, and it is hashed into DW buckets. So we do some operations over the entire matrix, and that is how to order W cross D. So now we are discussing briefly about the related work in this area. So anomaly detection in graphs is divided into static graphs and dynamic graphs. So for static graphs, like for all the categories, there’s an anomalous load detection, anomalous edge detection, an anomalous sub graph detection. So you can either detect nodes, which are not, which nodes are anomalous, which edges are anomalous, which sub-graphs are anomalous, something like that. There’s been a lot of work previously in static graphs, but I think over the past few years, people have been trying to conduct something like this in dynamic or time dependent graphs and for dynamic graphs, there can be two kind of inputs. Where you can have a graph stream, where you are written a new graph at each time instant or you can have an edge stream. You connect a new edge at each time instant. So an edge stream is at finer granularity, and we were doing, detecting anomalies in this. So within edge streams, our data basically lies in the edge detection category, and these were the previous state of the art baselines, SedanSpot, RHSS, and we compare with those in the experiments. So we also see that these two baselines, they were also streaming algorithms and online algorithms. We’ll see the performance aspects in a couple of minutes. But MIDAS detects micro clusters and it gives guarantees on the false positives with these approaches. So now the experiments. We see that we were basically using time evolving graphs, and any type of graph can be used, but we use DARPA which is very popular intrusion detection dataset. So this is the conference version, it was published on www.arxiv.org recently, a journal extension of this particular paper, areas like a few more baselines. Like a few more datasets. So for Twitter, we have TwitterSecurity and TwitterWorldCup. So these are all tweet related graphs, but again, like I mentioned, we can use any kind of graphs, any kind of time evolving graphs. So just a small note, here like for Twitter datasets, like these are publicly available Twitter datasets, for these that were basically, extracted during these months in 2014 for security, like security related events, like there was a bomb blast or some other Ebola virus outbreak during that time. And from the World Cup, like the red card, like a yellow card, a goal is scored, a penalty was given, something like that, so those kinds of tweets, and we just saw how our metric learning baselines could compare with that. So some of these experiments are, mostly slides are from DARPA dataset. So we see that with SedanSpot worked on a 64% as the AUC whereas MIDAS scored around 91%, MIDAS-R slightly improved upon MIDAS and it incorporated relations. So that went up to 93%. So the running times are for MIDAS, MIDAS-R and we observed that MIDAS-R takes slightly more time than MIDAS because it incorporates relations, but even then it runs within less than a second. So that buys around 4.5 million edges. So this is on my normal MacBook. Like it runs within a second, whereas SedanSpot takes up to two orders of, at least two orders of magnitude more time. And we also see in the next journal extension, so we also see that our implementations are slightly faster and we are at least three orders of magnitude faster than the previous baselines. Accuracy vs. time. So this graph basically will be, we see that MIDAS and MIDAS-R, they’re around two orders of magnitude faster, and they’re performing up to 50%, getting 50% more accuracy compared to the baselines. So that is just a previous scale of graph itself. And scalability, so we see that MIDAS, MIDAS-R, they’re kind of scalable. And with increasing number of edges, the slope is, the slope is only one line there with lesser than almost, that is also linearly scalable. And so then the real world effectiveness, so here we were concerting the Twitter graphs. So on the right side, you see that these are all the Twitter, secondary data related events, like there was a suicide bombing and Gaza conflict, Ebola virus outbreak, and so along with that, these are, so we brought in MIDAS-R and SedanSpot because of lack of space, but the peaks correspond to all these secondary data related events which happened in 2014. So, for example, sixth peak is actually Mpeketoni attacks, so this is not plotted using the data, so this is an unsupervising . So we linked, this is how Midas, how when MIDAS was correlating with the ground proof events. Whereas if you see, SedanSpot graph window so that is actually giving a high anomaly score for almost everything. So that is not really useful. We want a high anomaly score to correspond quite well to events, and we see that MIDAS will do that. So if you went into the details of this event, two and seven, so we saw that two and seven got detected by SedanSpot, it gave a low anomaly score. Then we kind of analyzed it in more detail and we saw that this was actually a micro cluster kind of anomaly, something like this, which was detected by calculating the relations. So that was a low-priority event that SedanSpot missed, and I think even MIDAS missed it, but because of that we incorporate the relations and we can see that. So MStream is the next extension of MIDAS. So in MIDAS, we were just concerning two features like source IP and the destination IP, now we’ve extended it to multiple dimensions where we have a higher number of dimensions and we have average packet size, correlation and these kinds of things. So, it can be basically anomalies, earlier considered and detected into source and destination. Here it can be detected in event log data, record settings, or when the attributed graphs, when each edge has multiple attributes, the challenges for this work and stream of that, now we have to analyze real value features as well, because if you analyze this table, so 1000 and 995 have to be kind of similar, whereas 100 and 80 would be similar, they were all categorical features. And then we also did some extra inhibited in this, which part is anomalous and what is causing the anomaly and the correlation between the different features. And we also need to have an added restriction here, that it has to be constant memory and time, both with respect to the stream length, like previously, and in addition to that with respect to the dimensionality as well. So, MStream, we’ve uploaded the paper on Arxiv and the code is also available on GitHub, so feel free to check it out. So after MStream, there was a lot of discussion over data where the repository became quite popular, people have been using it. So I definitely encourage anyone who wants to do that. And filtering MIDAS, the first one, we’ve almost completed it but that has come out of our GitHub users and because, I won’t say a flaw, but one limitation of MIDAS was that we never really use our own detections to improve the environment for the- So once we’ve detected something even better, if we can kind of filter that anomaly out so that it doesn’t negatively affect the data structure. So that is what filtering MIDAS does and we saw that the accuracy jumped to 99% with that instead of 95%. So the code is uploaded on GitHub and it’s updated, but the paper is another paper, so you can check out my webpage for that. We can also have a parallel distributed or a GPU based version from that, because, as we saw, MIDAS gained a lot of speed ups and having a parallel implementation then you will detect anomalies in parallel over several servers. That should be quite important for several applications and companies like Databricks and Google and all these kinds of companies running on multiple servers, you definitely want parallel implementation. So a hardware implementation could also be quite important similarly, and then that could be an interesting search over how it performs on knowledge graphs, and NLPs like this, so I haven’t tried that out with texture kind of graph data. But that should be interesting. Someone rush to do that. A periodic setting, so we haven’t yet considered that. For example, if you are buying something on Amazon, so you would definitely expect that your average or your expected increases over, say Christmas, on Black Friday, something like that. So you cannot have a constant kind of thing. So having a periodic setting and basically checking that out and making adjustments according to that, we haven’t done that yet. Apart from that, so my original interpretation was in C++, so in addition to that, people have been implementing it in Python, Java, all we ask, I think maybe there was several other languages, but Erlang and F# answer them, but I think there are a lot of different languages, so, you can either contribute on GitHub or check on the project and feel free to get in touch if you are remotely interested in anomaly detection, anomaly detection. And if you want to use it and try it in your settings, so we have made several other additions to this, we’re trying to incorporate deep learning and make several aspects there, because streaming algorithms usually, they’re kind of considered separate from deep learning, but we’re trying to see if we can make a one pass, I really like a small label data, like a small set of label data and can we use it to further enhance the performance. So definitely, feel free to get in touch if you want to collaborate in any depth. With that, I conclude. So on the data repository link is towards the right, the paper link is towards the bottom and with MIDAS we are detecting streaming microclusters in a streaming manner. And we give theoretical guarantees on the false positives. Also MIDAS was shown to be effective over the baselines, in a quite convincing manner and, with that, I will take any questions you all have. Thank you.


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back
About Siddharth Bhatia

National University of Singapore

Siddharth is a Ph.D. student at National University of Singapore. His research is supported by a Presidents Graduate Fellowship and he was recognized as a Young Researcher by the ACM Heidelberg Laureate Forum. His research in Anomaly Detection is used to detect intrusions, DoS, and DDoS attacks in an online manner. It can also be used to detect fake profiles in Social Networks like Twitter, Facebook, Amazon reviews, and Financial Frauds. MIDAS requires constant memory to detect these anomalies in real-time so as to minimize the harm caused by them. He is working with Amazon and applying his research.