Detection of abusive activity on a large social network is an adversarial challenge with quickly evolving behavior patterns and imperfect ground truth labels. Additionally, there is often negligible signal from an individual fake account before it is used for abuse. These characteristics limit the use of supervised learning techniques, but they can be overcome using unsupervised methods. To address these challenges, we created a Scala/Spark implementation of the isolation forest unsupervised outlier detection algorithm using the Spark ML API; we recently open sourced this library (https://github.com/linkedin/isolation-forest). We also developed a Scala/Spark unsupervised fuzzy clustering library to identify groups of abusive accounts with similar behavior. Key takeaways: In this talk, we describe the details of these unsupervised algorithms, why this type of machine learning is well suited to the specific challenges associated with fighting abuse, and how we use these models to detect accounts (fake, compromised, or real) that are engaging in abusive behavior such as automation and fake engagement at LinkedIn. We will also discuss best practices and lessons learned from our experience using these techniques in production.
– Hello, My name is James Verbus and Grace and I are gonna be talking about how we use unsupervised learning to prevent and detect abuse at LinkedIn.
So there are several unique challenges in the anti-abuse space where we think unsupervised learning can be a good solution. The first is labels. So often, there are few or no ground truth labels that we can use to train supervised models or evaluate the performance of our models. Unsupervised learning is a natural solution for the situation. The second challenge is signal. Often individual abusive accounts have very little signal until they start to become abusive or even if they are abusive, the volume of abuse is so low. It can blend in with normal organic user behavior. And in this case, you wanna look at many accounts that are all acting similarly to be able to get enough signal to be confident about predicting them as abusive.
The third challenge is that it can be very adversarial. So as we try to stop types of abuse, attackers are quick to adapt and evolve to circumnavigate our defenses. So even if we are able to obtain some labels today, they may not be representative of what abuse looks like tomorrow.
Again, unsupervised learning is naturally suited for this situation because as long as the attacker behavior is different the normal user behavior, we should be able to detect it.
So I’m gonna talk about the first of two types of unsupervised learning that we use at LinkedIn. In particular, I’m gonna talk about isolation forests.
So isolation forests were proposed first in 2008 and they consist of an ensemble of randomly created binary trees. To create one of these trees, we take a sample of training data and then at each node, randomly choose a feature and randomly choose a split value between the min and max value of that feature.
We repeat this recursively until in principle, all of the data is isolated in leaf nodes hence isolation tree. In practice, we don’t have to build the trees so deeply. We can apply a height limit. The intuition is that outliers are easier to isolate via this random process and show up with shorter path links from root node to leaf node while in liars are harder to isolate and have longer path links on average from root node to leaf node. This tree structure captures the multidimensional feature distribution of your training data and we can score future instances by tracing their past through the trees and the score is directly derived from the pathlink. And again, it’s an ensemble of these trees and we averaged the results to reduce the variance of the model.
So the plot on the right shows a toy example of a two dimensional feature space where one inline data point is highlighted as the red star. If we are to apply the isolation forest to this feature space, it would look something like this. We’d first select a feature at random. In this case, it happened to be the x and a random split value. We then repeat this process recursively until the red star is isolated. So this third split, the y-axis happened to be chosen as the random feature. And again, we just continue to do this until the red star is completely isolated in the plot. So at this point, the red star is completely isolated and it took 11 splits to do this. And at this point, it took 11 splits to do this. And you can see that this is different than in the alternative case where we choose a outlined data point highlighted with the red star here. And in this case, we start with the same split for a toy example and continue the process until it is isolated.
And it only took five splits in this case. And you can imagine on average, this will be the case where outlying data points take fewer splits than inlying data points.
So there are a lot of advantages to this type of a technique. So the first is its performant. A lot of recent benchmarks and reviews of outlier detection algorithms show that this performs better than the alternatives and in particular performs well as you continue to add more and more features to the model. Second it’s scalable. This technique has low computational complexity and low memory complexity so that you can train and score large volumes of data almost interactively.
Third, there’re fewer assumptions than alternative techniques. So it’s not parametric. We’re not making assumptions about the distribution of the data. And unlike nearest neighbor based techniques, we’re not assuming some distance metric. And forth, it’s increasingly widely used. There’s active academic research, recent papers that have various extensions and improvements to the base algorithm. And there’s also active industry use in particular in the anti abuse and trust domain.
So I’m gonna give you a flavor of how we’re using it at LinkedIn to detect abuse of automation. So this plot is a plot of all active members on a particular day at LinkedIn. Each blue dot is a member.
The y-axis is the isolation forest score. So higher is more outlier like and means more likely to be using automation. Whereas the x-axis is the number of user actions of interests. So the more activity you are doing on the site, the farther to the right you’ll be. So a very active member that is using automation would show up in the top right. And you can see it’s low density in this region but there’s a lot of interesting structure and we can explore a bit more of that now. So the cluster that I’ve highlighted in yellow with high scores. Turns out to be upon further inspection, a cluster of real members that are using automation tools but similar behavior.
And we can do a deep dive into these accounts and that looks something like this. So these plots are one for two different members, each one is from a single number from the cluster in yellow from the previous slide and the y-axis for both plots is automated user actions versus time.
And the top plot,
you can see this particular number of you to burst of about 30 user actions at a constant rate and then waited for some period of time that had another burst of roughly 30 and then waited and then did another burst of 30. Likewise, the second member, the one in the bottom plot did similar bursts, but maybe each burst was a bit smaller and they did more bursts in the day but overall, the total volume was similar. And you can see the automation pattern is similar for both of these accounts. This is why both accounts were identified as having high isolation forest scores and likely to be using automation because this behavior is very different than what you’d expect from a normal organic user on the site. So returning to our normal day or plot of normal user activity on a day, you may be asking, what does it look like if there’s a coordinated fake account automation attack where each account is controlled by the same actor? So they’re using the same scripts in the background. So that looks something like this. So you can see the structure in the bottom of the plot, looks almost identical. But this future day where there was an attack, we see this very clear dark cluster with a very high isolation forest score that’s very apparent. And this turns out to be, upon further inspection, a fake account attack that are always using the same script. So it’s interesting for a few reasons. One, it gets a very high score, even though there’s a very similar score. So like they’re all very similar values along the y-axis which means, all these accounts were acting very similarly.
Even though the volume of the activity had, in order of magnitude, variation between the left side and the right side of the right cluster, the second thing to point out is that the volume of activity overlaps significantly with normal users. So the red clusters along the x-axis overlapping with the blue bulk of normal users. But the model is able to separate them out along the y-axis because the behavior is automated and very different than normal users.
Similarly to the previous plot, here are two example accounts from that red cluster. Again, it’s automated user actions versus time. In these cases, each account wasn’t doing very much activity, only 30 to 60 user actions in a day but it was a pretty adversarial. They were randomizing the time between each request to try to blend in. Nonetheless, the model was able to identify that this activity is very different than normal users and give it a high score and it identified that this activity is very similar for the accounts within that red cluster and that’s why they all got very similar scores.
So I’ve talked a bit about how we use this for automation detection at LinkedIn but there are a lot of other potential use cases for this type automation detection. It can be useful to detect very sophisticated fake accounts for which you have no labels, advanced persistent threats and then surface these accounts for human review. Similarly, it can be used to detect insider threats at your company, another case where you often don’t have any labels or network intrusions into your company’s network. It can also be used to detect anomalous feature distributions. So you could create ML health assurance systems, where if you see an unusual increase in outliers in your features, you could get alerted. It can be used to detect account takeover, so bad actors that take over the accounts of normal users based on unusual login or post login activity. It can be used to alert on time series data. A canonical use case is using it to detect the payment fraud. And finally there’s some literature showing that it can be used to detect machines and data centers that are performing poorly. So based on CPU memory, network usage, et cetera.
So it worked so well at LinkedIn that we decided to open source it and we open sourced it last autumn.
So we took our Scala/spark implementation that spots distributed training and scoring and put it on GitHub. It’s compatible with spark.ml. So you can take this model and put it into your existing spark.ml pipelines. And we have artifacts available in both JCenter and Maven Central. So it’s as easy as just declaring the dependency and pointing at those artifacts if you wanna use it. So if you’re interested, check it out at, github.com/linkedin/isolation-forest. And there’s also an LinkedIn engineering blog posts linked on this slide as well. Where we go into a bit more detail on how we’re using it and what some of the advantages are.
So I’d like to acknowledge a few individuals that had a big impact in the initial part of this work. So Frank and Ahmed provided a lot of advice and code reviews. And Will, Jason, Milinda and Xin provided useful feedback ’cause they were the first users of this library in various use cases.
So at this point, I’m gonna hand it off to my colleague, grace, who’s gonna talk about a completely different type of unsupervised learning that we use at LinkedIn. – Thanks James. So when we think of unsupervised machine learning, I think oftentimes you think of clustering. It’s one of the first topics you learn of in unsupervised ML. And so for clustering, this is something we’ve been doing at LinkedIn for a little bit now and we started with something simple which was grouped by clustering.
So here, the goal was to find groups of accounts that had something shared in common, exactly identical. So an example would be all the accounts created on the same IP address on the same day. And when we group these accounts, score them with a model we are able to find clusters that look normally similar to one another and so we deemed those as fake accounts. And this technique is computationally very cheap that group by operations is fast and it’s also very effective. But as we started rolling out these models into production, basically the bad actors started to evolve. And so we needed to up the sophistication in terms of how we decided to form clusters. And so we moved to what we’re calling graph clustering. So in graph clustering here, we’re still looking for groups of accounts but the key difference is that we’re looking for the accounts that have something that is similar across all of them. And of course, if we’re talking about similarity, it’s a lot more expensive because you need to compare each user or each member in your member base to each other to determine if they’re similar or not. But what I’d like to show today is that it is trackable with some optimizations and when you do roll out these types of techniques, they are capable of detecting fake accounts that simpler approaches such as group by clustering can not detect.
And so with graph clustering, the very first step is that you need to figure out which members are very highly similar to one another. For today, we’re going to focus on the topic of content engagement. So we want to find members who are very similar in terms of how they engage with content. And what that means is underlying the abuse problem that we’re looking for is fake engagement on the platform. And so here in this schematic, I’m showing each member as essentially a circle and we have some information about them, namely, the content that they’ve engaged with or light on the platform. So that’s their content engagement. So the very first step then that we need to figure out is which comparisons to perform. Simplistically, we could say, okay, let’s just run end scoring. Compare all members against all other members but that job’s gonna take a really long time and probably crash. And so we need to be a little smarter about this process. So if we think of our data, we have members as rows and then we have the content that they engage with as columns. And so what we can do is perform some filtering. So row wise we can say, let’s remove all the members that haven’t engaged with much content. There’s not really much risk of fake engagement there. At the column level, we can filter out content that essentially is viral on the platform. Because if the content is essentially viral, well, everyone interacted with it. There’s not really much information there so you can remove it and not really suffer any performance issues. So row wise and con column wise training, very important. And then there’s one final criteria in terms of determining who we should compare with what class is we need to make sure that when we compare two members, they at least have some overlap in terms of the content that they engage with. And so using these conditions, we can avoid most comparisons and so on this schematic, the dotted lines are just showing the comparisons that we would actually make. So once we know the comparisons, next step is what is the similarity metric? And we are using Jaccard or actually Ruzicka. So Jaccard is basically the intersection over the union, works very well when you have zeros and ones. So basically yes or no, did a member engage with a piece of content? We’re using the Ruzicka version which allows you to work with real positive values and the reason why we have that instead of zeros and ones is because we actually perform a TFIDF type of waiting on to the data points. So basically content that is more popular contribute less to the similarity score and content that is less popular, less engaged with contribute more. So then once we know the comparison to run, then we need a threshold. So what exactly is considered high similarity? In this situation, we use the value of 0.7. Took a few iterations to get to this but there is some gut intuition here. And basically it depends on how easy it is for two members to just spuriously have high similarity. So if we’re using something like IP address where it’s a lot harder for two members just to have the same IP addresses in common, then he can use a much lower threshold. But for something like content, very easy for two members to engage with the same content. And so therefore that threshold is fairly high at 0.7. So once we know the comparisons to run, we have our similarity metric, we have a threshold, we can go ahead and make those computations and draw in those edges. And that’s what I’m showing here on this schematic. Now you may look at this and be like, looks like we already have clusters. Like you could run connected component analysis and those connected components would be your clusters. But if you look at the schematic a little more carefully, pay attention to the edges of those clusters. So it looks like there may be a few members that look to be spuriously linked to the rest of the cluster that is, are not that similar to the rest of the cluster. So there may be some noise there. And then if you look at the center bottom of this figure, you’ll see that there are two large clusters connected by one single member. So again there’s a possibility of a bit of noise here and that those two clusters maybe behaviorally different from one another. And so if you link them together, there’s a bit of noise there. So we need a bit of refinement on top of this similarity graph. And so what we use is Jarvis-Patrick algorithm. So in Jarvis-Patrick algorithm, the key idea is using nearest neighbors. So for example, I will have my 20 nearest neighbors in terms of members who look like me in terms of how I engage with content. And then James has his own respective list of 20 nearest neighbors. So Jarvis-Patrick will take those two sets, compare them and if they have some minimal similarity that is met, then we’ll say James and Grace are going to into the same cluster. So at the end of the day, what ends up happening is the number of clusters that you’ll get will be either the same number as the number of connected components, or you’ll get more. So basically some of the connected components get broken down. So that is what is shown here on the schematic where some of the connected components break into two or three clusters. And you’ll also notice that there are some members at the edges or fringes that are not part of a cluster at all. So basically Jarvis-Patrick is going to ensure that your clusters are more homogeneous in terms of the behavior that they exhibit. So Jarvis-Patrick was selected for a few key reasons. The first two, fast, deterministic, really useful in those early stages when you’re iterating with piloting a new algorithm. The next one, no need to define the number of clusters. This is incredibly important in the anti abuse domain because on any given day, you don’t know how many bad actors are on your platform. So you don’t know how many clusters you should expect. So an algorithm that can naturally figure that out is really important. And then the cluster don’t overlap. Great property to have when you’re trying to interpret your results. And then finally, Travis-Patrick accommodates clusters of variable sizes. This is, again, something very important in anti-abuse because you have bad actors that will come very heavy handed and spin up lots of fake accounts to run a fake engagement campaign. So that’s a huge cluster. But there are also other bad actors who are trying to stay under the radar and thus will just spin up a few fake accounts, so slow and steady. And so those will be small clusters and we don’t want an algorithm biased to one or the other. We want to be able to act on both. So Jarvis-Patrick did live up to these expectations.
And so here are examples of fake account clusters that we detected and every single one of these clusters, every member is fake. So the results are a hundred percent clean. The way to read these figures is that the black dots are members and then the colored lines represent the similarity between those members in terms of how they engage with content. And similarity is ranging from 0.7, which was that threshold all the way up to 1.0. So in the upper left hand corner, that’s actually a click. So that’s an example of a really brute force script where all of the accounts controlled by that script like the exact same pieces of content. Now, in contrast, if you look at the bottom row in the center that’s sort of the shape, that one’s very different. So they’re the bad actor runs one fake engagement campaign that corresponds to serve a high density point and then takes one or two of those accounts either test the next round of fake engagement and then so on and so forth. And that’s why you have this dense patch alternating with sparse patches in that V-shape. Now we would like to think that results are always going to be all the things that we want but there’s always false positives and false positives for us are basically clusters that are formed of real accounts. Here are six examples. In the top row, you’ll see employees of a company. What happens here is we have small companies that post content then ask all their employees to like that content causing this sort of nonorganic behavior that the algorithm picks up. In the bottom row, these are all cryptocurrency interest groups. Here, these interest groups are posting and the posts have a call to action saying, “Hey, please like this content and I’ll give you free membership to this other website.” So this call to action causes this nonorganic behavior and thus the algorithm picks it up. Now these are false positive because they’re real accounts and they’re also very behaviorally different and not malicious compared to the site on the previous slide.
So what this means for us is that graph clustering as well as isolation forest, what they’re guaranteeing is to give us accounts that are behaviorally similar but these could be fake accounts or these could be real accounts such as employees at companies. And so what that means for us is we can’t just take the results and just take down all these accounts from the platform because there’s a mix, there’s some false positives. So in practice, what this means is that we need to layer on a supervised model or some heuristics so that we can separate the fake accounts from the hacked accounts, from the real accounts. The reason being because we have different outcomes or mitigation techniques for each of these populations. So for example, for fake accounts, the goal is to get in there and take them down from the platform as soon as you can. But for hacked accounts, it’s almost more of a white glove service. You want to secure the account and then really help the members, real member to get back into their account. And James and I didn’t show any examples for hacked accounts today, but we have seen successes using unsupervised machine learning in house to detect hacked accounts. And then finally, for real accounts, which we saw a few cases of, the outcome is going to differ depending on what they’re doing. So the example of employees at companies, it’s not really malicious. So not much we need to do there. But if, for example, these are real accounts that are using automation to send a bunch of invitations on the platform. Well, that will degrade the health or quality of the network and so we may want to educate those members into better behavior. And so with that, we just like to tie it all back to where we started which was some of the unique challenges in anti-abuse. So in anti abuse, one of the key problems is that you don’t have labels. And so when you don’t have labels, naturally unsupervised machine learning is a great fit. But it’s also important to remember when you apply these techniques, you get these clusters or groups of accounts that are behaviorally similar. And at that point you can send those for labeling ’cause you have a lot fewer clusters than you do of accounts. And more importantly, when you’re looking at clusters, you actually have way more signal. So as James referred to in the beginning, sometimes you don’t have enough signal when you’re just looking at one example. So that fake engagement stuff that I talked about, some of them who engage with five pieces of content, really camouflage in the proper population but super obvious once you run graph clustering. And then finally in the anti-abuse domain, we can never forget that it’s an adversarial relationship that we have with the bad actors. So we get more sophisticated, they get more sophisticated and the game begins all over again. And because we’re not relying on labels and unsupervised learning, that’s great. It keeps us more flexible. But we do have to remember that the bad actor behavior must map to a different space compared to the normal user behavior. And I think we showed today that both isolation forest and graph clustering can both achieve that, separate malicious actors from normal users. And so with that, I would just like to thank a few folks particularly Ahmed and Adam who released spearheaded and conducted most of this work.
James Verbus is a Staff Machine Learning Engineer on the Anti-Abuse AI Team at LinkedIn. His current focus is the detection and prevention of abusive automated activity. James received his Ph.D. in experimental particle astrophysics from Brown University.
Grace Tang is a Senior Staff Machine Learning Engineer on the Anti-Abuse AI Team at LinkedIn. She works across abuse domains, focusing on integrating detection systems together to achieve defense in depth. Grace received her Ph.D. in Bioengineering from Stanford University.