Large-Scale Malicious Domain Detection with Spark AI - Databricks

Large-Scale Malicious Domain Detection with Spark AI

Download Slides

Malicious domains are one of the main resources used to mount attacks over the Internet. It is important to detect such activities by mining the large-scale network traffic data and identifying malicious URLs, domains or IPs. The attackers often take advantages of vulnerabilities in DNS and commit activities such as stealing private information, spamming, phishing, and DDoS attacks, and tend to by-pass botnet detection by generating domain clusters from Domain Generation Algorithms (DGA). We have billions of DNS records per day. Spark AI platform hence serves as an efficient distributed platform for the processing and mining of this huge amount of data.

We work on the following two cybersecurity use cases.
1. Detect DGA, Porn, and Gambling domains Each malware-compromised host machine will have a large amount of DNS request in sequential order. The domain names are either generated by DGAs or preserve particular string patterns by design. We use spark to generate DNS request domain sequences and use Word2Vec to estimate the embedding of the domains. We then estimate the similarity and the most similar domains in the embedding space are discovered as the potential malicious domains.

2. Detect cryptocurrency mining pool domains The attackers are interested in accessing computing resources to mine cryptocurrency. The malware infected computers will be directed to attacker-controlled mining pool domains. This type of DNS request does not preserve sequential order and is relatively random. Since each mining pool domain cluster is visited by a wide range of different host machines, we used LSH to evaluate the similarity among sets of hosts. As a result, LSH generates domain-bucket bipartite graph and FastUnfolding algorithm is used to discover the domain clusters. We leveraged spark AI for large scale DNS data analysis and discovered hundreds of thousands of malicious domains each day at high precision.


Try Databricks
See More Spark + AI Summit in San Francisco 2019 Videos

« back
About Ting Chen

Ting Chen is a Director of Applied Machine Learning at Tencent, helping the team to apply big data and machine learning techniques to solve highly impact business problems in Security and Healthcare, as well as to build engineering scalable ML platform. Previously, she was a Data Science Manager working on Risk Management and Senior Machine Learning Engineer on Mobile Platform at Uber Technologies Inc. She also worked as a research scientist at Huami, Roche Diagnostics and IBM Research on healthcare related problems. Ting holds a Phd in computer science with research background in computer vision and machine learning.

About Hao Guo

Hao Guo works as Applied Research Scientist at Tencent. His main job is to analyze and mine security data using machine learning and big data techniques for Tencent Security Brain, which provides security services to customers. He has a Master's degree in Computer Science and has published several research papers in NLP and patents in security. His interest is in deep learning and large-scaled machine learning, with a focus on applications in security.