Learning to hash has been widely adopted as a solution to approximate nearest neighbor search for large-scale data retrieval in many applications. Applying deep architectures to learning to hash has recently gained increasing attention due to its computational efficiency and retrieval quality. However, existing deep architectures are not fully suitable to properly handle ‘sequential behavior data’, which are types of data observed in many application scenarios related to user modeling. We believe that in order to learn binary hashing for sequential behavior data, it is important to capture the user’s evolving preference or exploit the user’s activity patterns at different time scales. In this work, we propose a deep learning-based architecture to learn binary hashing for sequential behavior data. The proposed framework utilizes Spark platform for large scale data preprocessing, modeling and inference. We also describe how the distributed inference job is performed on Databricks with Pandas UDF.
– Good afternoon everyone. Our topic today is user behavior hashing for audience expansion.
And we are from Samsung.
Our brief agenda today. This is Praveen and I’m director of engineering at Samsung Research America. And I’m going to be joined by my colleague our co presenter Yingnan who is our lead data scientist at Samsung Research America. Together, we are going to cover our topic, the user behavior hashing for audience expansion. So at first I go and provide you a high level overview of Samsung. What we do, next I follow up with I’ll provide an architecture overview of Samsung audience platform and I will introduce your lookalike modeling. Then Yingnan will cover the rest of the topics with respect to the specific deep dive into how we accomplish lookalike modelling using hashing techniques. And then we provide benchmarks and model performance. And then we can go through some q&a.
So we are Samsung, we are global company, we have more than 300,000 diversified employees. And we also have global revenues to the tune of 220 billion dollars. And more importantly we are also operated at global scale. And we have 35 R&D centers across the globe. And our group specifically presents, the Global Research Group Samsung Research America and we are going to take a little bit deep dive of that.
Samsung research I mean, these are some of the core areas of research for us. They start with artificial intelligence. This is where we focus more on hardware as well as software side of AI and then followed by data intelligence given we are Samsung we have huge amount of data sets. We also focus on our 5G and 6G, all the technological advancements related to mobile. Then we also focus on robotics, and Tizen is our operating system for TVs. And last but not least, we also focus on next generation display and media. So our specific example today is more deep dive into the next generation, display and media.
Because we talked about whole lot of data that we at Samsung we handle on a day to day basis, having a robust audience platform is absolutely necessary to handle this huge amounts of data. So with our audience intelligence platform, we focus on some of these core areas. We work on recommendations, we work on user modeling techniques, and we also work on multi-modal techniques related to voice and vision. And all of these are also powered using our AI experience. So as part of our audience platform, we work with both first party as well as third party data sources, so first party includes the data that we collect from our consumer electronics devices that include TVs, mobile phones, IoT devices, etc. And when it comes to third party data, it is obtained from TV networks, ads, and third party device graphs that we have, and so forth. Now what we do is we bring our first party data as well as third party data together into our data platform. So we proceed by following the steps of ingesting the data first, using our batch as well as real time data processing. Then we store them because the amount of data is so huge we store for several months of data here. Once you ingest the data, we basically have our machine learning as well as deep learning platform as part of our ML & DL practices. These are some of the algorithms approaches or problems that we saw on daily basis. Those are related to lookalike modeling, problems related to recommendations and personalization. And again, some of the problems related to optimization, as well as attribution, especially those two are some of the problems related to the advertising space. And last but not least, some of the problems related to fraud detection, and also use natural language processing in order to process speech, voice and all of those. Now, what is also supported by our ML & DL platform is also a model management framework. And also our experimentation framework. We run a bunch of A/B tests on databases. But once you have this AI or ML platform, we are going to have the data visualization as well as API’s. Now, when you look from the right side of the picture, it is primarily the business applications. So through our platform, we support these core businesses. So Samsung ads has been one of the tremendous growth drivers. So this is the platform that basically supports it alongside our Samsung marketing. And we also have a lot of recommendations as well as personalization that is enabled on our TVs for our consumers. And many of those use cases related to multi-modal and IoT. So those are the core business applications that are being supported as part of this platform. Now, if you focus on the bottom of the screen, there are five different stages. First, we start with audience discovery, this is where we ingest all of our data from from all the devices, the user interactions and whatnot. The next phase is to basically do the high level segmentation understand where exactly these audiences are coming from. Think about demographics, location, etc.
It is followed by audience expansion. Today we are going to focus on this. As part of the rest of the session. Once you expand the audience, you would then be able to drill down into how you want to tune your audience to focus and target to some of those specific campaigns. And last but not least, how will you how help, target is measured by the audience measurement techniques, especially attribution.
Let’s talk about lookalike modeling from Samsung context.
So as part of the lookalike modeling, we have two different goals that we want to cover today. first goal is how we can improve our incremental reach. And second goal is how we can improve our targeting. Then we are talking from the perspective of two different use cases, the first use case being TV networks. The second use case is for Samsung new TV purchases. So, let’s actually dig deep into TV networks. What I mean here, especially when it comes to new shows that air on different TV channels during fall premieres, and so forth, how can you really go and identify some of those new audiences that replicates some of the behaviors of your existing audiences. So this is where you think of, you know what?
I have context of certain types of audiences, and how we can make use of the user context and how I can expand it to potentially identify new audiences. Now, the same approach or methodology is similar for new TV purchase as well, considering the existing TV universe, understanding who is already an existing 8K or 4K particularTV owner, understand from the perspective of what type of user behavior they exhibit, and how could you really make use of it to find out who are your potential new TV purchasers. So these are the two main goals that we can potentially solve by using your user by using your lookalike modeling techniques. When it comes to our approach. As part of Samsung, we have ACR viewership data, which we basically have from 50 plus million TVs in US. And by applying user behavior hashing techniques, we wouldn’t be able to identify those TV viewers that are similar to existing audiences based on user behavior.
So let’s look at our lookalike audience expansion example. So on the graphic on the left, so imagine the full circle is our entire TV universe. Here we are specifically talking about, how can I find those audiences that are similar to my existing TV owners.
So as you notice, the small circle on the left which is A, so A is my seed audience, now I want to find out those audience that are similar to that seed audience of A. and this is where it is highlighted as the dotted circle. Now if we apply all of this to our hashing technique, what we would be able to determine is, we would be able to determine the circle B, which is an expanded segment of A.
So using this, you would be able to figure out given a seed audience A, how you would be able to figure out and identify the expanded set of B. To actually go into more details, and how the hashing technique itself work. My co presenter Yingnan, would be taking it through from here.
– Thank you Praveen. It is my great pleasure, to talk about one of our previous works that’s related to audience expansion using lookalike technologies.
In this particular task, we are trying to find a rank group of similar users. As you know, Samsung has a large scale of a user base, including Samsung Smart TVs and mobile devices. And the the user interaction data scale is huge.
All this data, user interaction data are related with something like content in your content consumption, video on demand consumption, gameplay, application usage and even interaction with external devices. Because of this large scale of data sets, it’s actually pretty important for us to derive efficient algorithm to to find the similar users.
Although this task can be run offline, we still want to limit others resources spending on this particular task. There have been a lot of works in the industry to solve a similar problem in large scale to find the K nearest neighbors, nearest neighbor search. This one typical approach, is called the locality sensitive hashing LSH. And there are also techniques, related to finding similar users in recommender system. However, these type of approach, sometimes they don’t capture the user’s behavior change efficiently. They also cannot capture the contextual change when the user have this kind of interaction with the devices. So to solve those problems, we have to define a very efficient hashing methods that can capture all the contextual information and preserve at the same time preserve the similarity between users. So that we can actually generate in a bucketized user search space, that we can search and find most similar users in a timely manner.
So this is the high level workflow for this particular use case we mentioned. First of all, we collect data from Samsung first party plus some third party data Then we do some pre processing on user behavior data, then we can get it into the deep binary hashing model that we’ll talk about.
After running through this deep hashing model, we will generate a heterogeneous hash code for each users. This hash code is efficiently to be used in fast search for the particular lookalike use case we can use the seat segments of users and a by bucketizing this user hashing code. We can implement a fast search algorithm that can very efficiently to find similar users in so that we can expand the audience even in online in a real time manner. But the training process and the hash code generation process, we will run it offline.
This slide shows the high level of training flow that we use in the hashing process. The training is based on user pairs. By utilizing external knowledge or predefined user similarity, we generate two users block input and this input will be will go through our network layers. These network layers will be explained in the next few slides. After we learn these representations from these network layers, there’s a hashing layer to specifically generate the heterogeneous hash code for each user. After the sine function.
We mostly sigmoid we use here, then we can predict weather the users similar or not?
To solve this particular problem, we try the different network architectures. This, the workflow we talked about is not specifically to just one deep network architecture, we can actually try different architecture once you can decide to fit into the workforce. So today I’m gonna talk about two of them. The first one we code is a time where attention CNN model.
In this in this model, we have basically four layers. The input layer, is the data pre processing layer. It actually maps the sequential behavior data into a 3D structure that can be processed by convolution neural network. Because the behavior data is represented by user interaction with items, the first step we do in the pre processing Is to convert each item into a vector support. in the second step we do is extra transectionize users’ history into different time unit. For each session, we aggregate all the items that the user has interacted with using the embedding of the items generated in the first step.
And you can look at the 3D image down there.
The h-axis is actually the short term time unit we defined. The w-axis is the long term, mid term to long term behavior of user time units. That the d-axis is representing the user different embeddings for different items in this embedding layer usually, it is unembedded, it means that it carries some more actually conceptual information than similarity information that we preserve. This actually would affect the overall performance of the TAACM model.
Especially its ability to preserve the similarity information. So, to overcome this situation, we introduced the embedding layer as part of the model, this embedding layer applies a specific designed convolution kernel. So that it can transfer transform the previous layers upward into a adaptive distributed representation. The next layer will be the time aware attention layer. This this particular layer is used to abstract time aware attention and features in this model, this layer separates attention features into the short term, midterm and long term, short term features are features abstraction that emphasizes users a small time scale you know smaller time scale maybe a day a week, long term features capture a longer term, maybe a month or season kind of a time range. that we try to capture users recent activities and long term preference at the same time. last layer will be the aggregation layer, aggregation layer, all features from the previous layers will be flattened and contented together in this layer.
This particular slide explains the description of each layers we mentioned previously.
You set the TAACNN model.
We also introduced another deep neural network that is called a category attention model So, because users contents search behavior are mostly related with genre information of content or type of information of applications, this categorical attention model try to cut to different preferences on user or from user on different genres of content. So, to efficiently learn the correct user representation from sequential behavior, so, we wanna build a hybrid attention network categorical aware attention network. So this proposed model uses a attention mechanism from the user behavior history grouped by the item category information from the group list. This attention network can can discover important items which are useful to represent end users by the appropriate preference. The reason we choose the attention versus other networks such as LSTM, RNGRU, it’s actually because of the attention secure performance.
Our networks are actually also composed of four layers. The first one is the sparse input layer that captures the user of the item in the user’s interactive history with items and group them by category.
The embedding layer then learns from these item embeddings, and the attentional layer that computes the weighted sum over items embedding per category. And finally, we combine all these layers on top of this metadata attention layer. The output of this particular layer represent the user’s embedded. As a result, the final user embedding what contains user’s preference, with a good reflection of the long term, historical or behavior patterns.
Due to the large scale of the data, the nature of the data we process. Although this similarity preserving hashing can be calculated offline, but we want to limit it the time to update to for each iteration so that we can update the user’s hash code more frequently to capture their recent behavior.
At the beginning, we have to use the, when we run the inference, we use Python UDF.
And clearly it is a very inefficient methodology, because it’s addressed as a row at a time. Manner. The good thing is Spark introduced a pandas UDF from three iterations back and when we are doing that, this particular work, we tried to utilize the pandas UDF group map functionalities. So, pandas UDF, is actually the one of the biggest performer booster from Spark, so they actually performs more than 100% times faster than the traditional Python UDF utilizing the Apache Arrow.
I changed data very efficiently between the Java virtual machines and the Python drivers.
So, in the next slide I’m going to show you, This slide shows the code snippet that how you run the pandas UDF in a group map, for that matter.
So as as mentioned in the TAACM model, all this user behavior data was reconstructed into a 3D dimension. And we’ve we treated them as images and feed this into the convolution neural networks.
We write this group map function, and then so that for each particular group, we can utilizing we can apply this particular pandas UDF function. This improves our performance speed, to 10s or 20 times faster.
One thing we noticed it’s actually for a group map data, we have to load all the group into memory. And so we have to carefully select how we control the memory so that it can efficiently run through all this large scale data that we need to process without causing any exceptions. And eventually, this particular pandas UDF improved the speed significantly.
We also evaluated our two models versus a well known hashing or similar user search approaches.
And we measure this performance by accuracy.
when we did this in a fusive phase, we run through three different data sets.
One is MovieLens, this is a very popular recommender system to me. The other one is Goodreads data set. This actually contains users book reading logs for more than 10 years. And the third data set we tried is behavior data from a sensor internal. We compared these two approaches with LSH, and neural CF and the other approach called BSH, which is also a way to generate user hash. In all this data sets, our model has actually outperformed in different bits length of hash code.
So the reason we run different bits of hash code, is to have a wait, the longer the bit lenghth, it can capture the user’s profile more precisely.
In conclusion in this particular work, we designed a novel deep binary hashing architecture, which can utilize different type of deep neural networks,
then this particular architecture helped us to generate the similarity preserving user hash code. And this hash code can be efficiently used in similar user search.
And our TAACNN model captures the user’s preference with the different timescales, and try to capture the users recent activities and long term activities. And the other category of attention model utilising categorical information and also captures users, time sequence behaviour and, The most important thing here, is the pandas UDF helped us to improve the speed significantly, so that we can update the user’s behavior hashing, you know much more frequently manner.
So, I want to say thank you to all, this Spark plus AI summit 2020 participants. And last but not least, we are also hiring I sorry is hiring, our team is also hiring. There are many different positions related to data engineering and data science. If you’re interested, please feel free to contact Praveen or myself through our email or LinkedIn profiles.