Ting Chen is currently an Engineering Manager in Trust AI at LinkedIn. Previously she was a Director of Applied Machine Learning at Tencent and also worked as Data Science Manager and Senior Machine Learning Engineer at Uber. Ting holds a PhD in computer science with research background in computer vision and machine learning. Note that the work was done when Ting was affiliated with Tencent.
Security operation center (SOC) enables enterprise to monitor and analyze organizational-wise security postures. SOC platform monitors invoking behaviors of software in devices and generates tens of millions of behavior event sequence data daily. In this work, we propose an explainable anomaly detection method using Spark AI, which automatically discovers new threat patterns from event sequence data and provides reason code for security operations. The results are further enhanced through interactive visualization. Since we need to handle tens of millions of records per day, we rely on spark for large scale data analytics and modeling. We leveraged an unsupervised anomaly detection algorithm based on Variational AutoEncoders (VAE) in this work. The model learns latent representations for all sequences of events and detects anomalies that deviate from the overall distribution using One-Class SVM. We also employ a visualization system to facilitate interpretations of anomalies and attributes for security operation. Finally, we quantitatively evaluate the performance of our anomaly detection model and demonstrate the effectiveness of our system through report and feedback collected from SOC platform.
Malicious domains are one of the main resources used to mount attacks over the Internet. It is important to detect such activities by mining the large-scale network traffic data and identifying malicious URLs, domains or IPs. The attackers often take advantages of vulnerabilities in DNS and commit activities such as stealing private information, spamming, phishing, and DDoS attacks, and tend to by-pass botnet detection by generating domain clusters from Domain Generation Algorithms (DGA). We have billions of DNS records per day. Spark AI platform hence serves as an efficient distributed platform for the processing and mining of this huge amount of data. We work on the following two cybersecurity use cases. 1. Detect DGA, Porn, and Gambling domains Each malware-compromised host machine will have a large amount of DNS request in sequential order. The domain names are either generated by DGAs or preserve particular string patterns by design. We use spark to generate DNS request domain sequences and use Word2Vec to estimate the embedding of the domains. We then estimate the similarity and the most similar domains in the embedding space are discovered as the potential malicious domains. 2. Detect cryptocurrency mining pool domains The attackers are interested in accessing computing resources to mine cryptocurrency. The malware infected computers will be directed to attacker-controlled mining pool domains. This type of DNS request does not preserve sequential order and is relatively random. Since each mining pool domain cluster is visited by a wide range of different host machines, we used LSH to evaluate the similarity among sets of hosts. As a result, LSH generates domain-bucket bipartite graph and FastUnfolding algorithm is used to discover the domain clusters. We leveraged spark AI for large scale DNS data analysis and discovered hundreds of thousands of malicious domains each day at high precision.