Ping Yan - Databricks

Ping Yan

Research Scientist,

Ping spent a decade innovating ways of making sense of data in various domains, from consumer behavior modeling to algorithmic security threat detection. Her works were published as journal articles, monographs and books. Ping holds a Ph.D. in Management Information System from the University of Arizona with a focus on Machine Learning and AI. She is currently a Research Scientist with the Salesforce Security Analytics team. Ping spoke at various Data Science and InfoSec conferences such as ICIS, WITS, CanSecWest 2013, OWASP AppSec 2015, and Spark Summit 2016.


A Graph-Based Method For Cross-Entity Threat Detection

I propose to use a distributed graph-based approach to detect cross-entity attacks via correlating global events on multi-tenant platforms. Detection efforts have mostly focused on detecting each incident individually, while in most attack scenarios, it is a single attacker or attacker group that goes after multiple targets often via stolen credentials within a rather concentrated time window. Coordinated or concurrent attacks seriously impact the trust of the multi-tenant service platform provider when customers get infiltrated on their platform. How can we detect these cross-account attacks by quickly making connections across concurrent incidences? MConnections are often buried under terabytes of data and among tens of millions of legitimate connections. Only a complete graph with a proper level of abstraction of all information and smart algorithms provide us a viable solution. By representing all entities of interest (i.e., an organization or an IP address) in a graph, we can efficiently track the connectivities among these entities that allows us to differentiate unexpected connections that is indicative of cross-account attacks from legitimate cross-account relationship (for example, two accounts belong to the same customer) by identifying correlated threats. Change detection algorithm is proposed to identify unexpected connectivities of accounts with a graph. For example, as we detect suspicious behavior across multiple accounts, how do we know if this is a large-scale account take-over, or just a legitimate license upgrade that results in novel behavior across multiple users. If affected accounts are already densely connected, a suspicious concurrent behavior detected is not as interesting as where it is detected among highly disjoint accounts. A graph provides a holistic view of how components are connected. A graph-based solution is essential to security defense techniques. It gives us a number of opportunities beyond cross-account attack detection such as intuitive context retrieval and interactive visualization,

Needle in the Haystack—User Behavior Anomaly Detection for Information Security

Salesforce recently invented and deployed a real-time, scalable, terabyte data-level and low false positive personalized anomaly detection system. Anomaly detection on user in-app behavior at terabyte-data scale is extremely challenging because traditional techniques like clustering methods suffer serious production performance issues. Salesforce's method tackles the traditional challenges through three phases: 1) Leveraging Principal Component Analysis (PCA) to extract high-variance and low-variance feature subsets. The low-variance feature subset is valuable in cybersecurity because we want to determine if a user deviates from his or her stable behavior. The high-variance one is used for dimension reduction; 2) On each feature subset, they build a profile for each user to characterize the user’s baseline behavior and legitimate abnormal behavior; 3) During detection, for each incoming event, their method will compare it with the user’s profile and produce an anomaly score. The computation complexity of the detection module for each incoming event is constant. st cloud computing platforms; the novelty of our user behavior profiling based anomaly detection technique and the challenges of implementing and deploying it with Apache Spark in production. We will also demonstrate how our system outperforms the other traditional machine learning algorithms. Session Hashtag: #SFml5