Text or Image classification done using deep neural networks presents us with a unique way to identify each trained image/word via something known as ‘Embedding’. Embedding refers to fix sized vectors which are learnt during the training process of a neural network but it is very difficult to make sense of these random values. However, these embeddings are very powerful and carry a lot of hidden information about the object that it represents. This session will unlock some of the different ways in which embeddings can be visualised and be comprehended from the aspect of performance of the model as well as underlying signal in these embeddings to represent the actual object (text/image). For example, the customer journey online can be translated into these embeddings and can be used to find the real intent and to differentiate between the potentially interested visitor vs a casual visitor .Once decoded, these embeddings become more friendly and can be plugged in at a number of places such as comparison, classification and retraining of the model itself. This session will cover how to unlock the real power of these embeddings using different tools.
– Hey, hello everyone. Thank you for joining in this session. I’m really excited to be taking this session today. Today, I’m going to talk about, Power of Visualizing Embeddings.
I know some of you might have already worked with embeddings, for some of you it might be very new topic. But the overall idea that I want to convey today is to history that how embeddings have been super powerful in some of my work and some of my colleagues’ work as well. With that thought let’s proceed in the presentation.
A little brief about myself. I’m currently playing a role of Team Lead – Data Science, Bain and Company. I’m part of their advanced analytics group. Alright, so today’s agenda, I want to start off with the inspiration that I got to sort of present this session and what was the reason behind putting all these slides together? So I want to take you on a small journey that sort of gave me the importance of embeddings in a machine learning field. So I was going through multiple stages in order to make my decision to buy a car. And what I realized was it’s not just me, every other customer who plans to buy something it might not be a very, very big thing, but even if it’s for a TV or it’s just a mobile phone, we generally spend good amount of time doing our own research. And what happens is in this whole phase of research, we sort of get matured towards our decision making towards buying that product or not. After finding out relevant information, we either decide to go ahead and buy that product, or we sort of cancel out. And similarly, it’s not just for the car or any other product, it’s of human nature that before you make a decision, there’s a process or there’s a rational behind it. And that’s what we want to sort of utilize or leverage on using embeddings in today’s session. So, in the process of finding out key information about that particular car, what I ended up doing was, I left a set of digital footprints for those different auto portals. What they could garner from my activities on the website was the different page categories that I visited. So I was looking into different finance related information for the car exterior information. What are some of the key points on the impedors? What are the key differentiators in terms of the maintenance or the dealership of the car itself? And during this entire process, what happened is you leave a set of information points for the, maybe car businesses who wants to use this information for their own sake.
So this is a simple way of representing a user journey in layman terms. But we can certainly build on this. We can make, take this to a even denser representation. We can dig this information further and include the different page categories itself that have visited. So, we can see for particular finance related pages, how many times I’ve visited, specification related pages, dealer specific pages, reviews related pages. Not only the frequency count, but also how much time did I spend on each of these specific page categories. And finally, whether I converted or not. So you can see the difference on the initial level, it was more of a basic representation, but we have now advanced and providing more information point to the user. And it’s not just specific to the information on the accessibility of the website or your clickstream data. But you can add additional information points to the user profile as well. Like if you have demographic information, which region the customer belongs to, some of it’s financial information, what is the customer salary is, or what are some of the preferences, what’s the age of the customer? And all those can sort of add more vintage to your overall representation of customers.
So, this is a advanced level of representation for a customer.
Moving on, it’s not just a single customer, who’s visiting a website. At the end of the day you have millions of customers visiting particular website. Let’s, in this example, let’s take car only. So you have a lot of people visiting different sections of the car website, and each of these users then end up with their own history, their own user journey, which can be sort of represented in the similar manner, what we tried with only a single user.
And here you can see, we have multiple users with their own specific information points regarding their user journey. Where I’m taking you from this is, this is a very conventional way of representing a user towards the interactions that users had with the platform.
And it’s not just limited to auto portal, wherever there is a customer journey involved, we can use these kinds of representations. Like if you’re looking for a good insurance policy, you might want to do little bit of research and then, try to make your decision, or it can be as simply e-commerce or a retail journey that you have gone through in order to buy a product, or it can be a simple plot or a flat that you’re looking for, and you’re doing your online research. So all of these places that are, it includes customer journey, you can try to represent this sequence of events in different manners.
So what are some of the key questions that businesses can have with this information? If I’m a business head then, and if I have this information, some of the questions that I would like to answer, or, sort of you use this information for, is which set of customer journeys are similar to each other? There are, since there are millions of customers, accessing the website, but can I figure out which are the most similar customers in terms of their journey? Which set of journeys, indicate broken versus seamless experience? Like, were you able to find out the particular information that you are looking for, or it was a broken experience, or you actually, got the content that you were looking for? And finally, which are those five, four, five major, routes that a customer takes before he converts? Can I get other customers on the similar track in order to speed up the conversion process? To get more business, or I can make their experience seamless. These are some of the key questions that businesses can really, get more value out of, if we able to use embeddings, especially in terms of representing the customer journey.
Now, you, we saw how we can represent a user using different frequency count or the time spent on different pages. But there’s definitely some of the alternative approaches that can be used to represent customer journey. One, let’s take a step back rather than representing a user itself, let’s just stick to categories as of now. So, we have seen there are different base categories the customer is going through like reviews, you have finance related pages, you have service related pages. So, a conventional way to represent any category, discrete category is one hot encoding. Where you represent, it’s generally a sparse vector with only one value filled in for that specific video. Then you have some methods of, which are frequency based. Like you have traditional concrete riser or TF-IDF, which uses the frequency count in order to represent your categories. And third, and the final one is prediction based, which essentially indicates toward embeddings and which we are going to delve into. But before going into embeddings, let’s just see some of the challenges that we have in the convention methods. Suppose we are using one hot coding, there’s a major issue in using one hot encoding or conflict riser is that it doesn’t solve the high cardinality variable issue. What I mean by that is, the, if you have higher number of categories, that many number of size of vector or representation you end up with. Imagine if you have a 50,000 or 1 million of unicategories in your business, you might end up with that much length of vector to represent each category. So that’s a drawback. Second is semantic signal. Each of these representation doesn’t take into concentration if there is any relationship between any of those categories. So it does not care if category A is similar to category B. For the representation purpose, it’s all the same. And it’s if you want to visualize in terms of how far or closer each categories are to each other, it does not help in that sense. And final drawback is it does not consider this supervision into account. So it doesn’t matter if the user has converted or not. In a way to represent a customer journey or a category, it just simply takes the count of frequency. It doesn’t take any sort of supervision into concentration. So that’s one major drawback or challenges in using our one hot encoding kind of techniques.
To give an example, let’s say in this case, we are, we take three categories, specification, price, features. And if you want to represent these using one hot encoding, we can do this in this given manner. What we see on the screen, like for specification only this value would be enabled and rest of them would be zero resulting into a sparse matrix. The drop, I guess, it was the number of columns would be equal to the number of unique categories you have in your data. But, like I mentioned earlier, the other drawback is, if you want to calculate how similar or dissimilar these categories are to each other, you cannot do that using one hot encoding. If you just calculate the distance between any of these two, would result into zero, because it’s just one value in the entire column, which is with one value and the rest of them are Fitbit zero, hence, these are some of the drawbacks if you use conventional representation methods.
One thing that this kind of approach misses out on, is it totally ignores the sequence of events that has happened. And like I said, at the start of the presentation, that it’s a journey. It’s a sequence of events that results into a final decision, whether you’re going to buy a product or not. And hence a sequence of events become super critical in order to represent a user journey. So the overall goal is, can we represent each of these page categories with the vector, which captures the underlying semantics? Like for us, we understand what is a finance related category? What is a price related category? What is a review related category? But for our model or for our machine, those are all the same. If I can give you an analogy is like for English words, we understand words by their meanings. We know in each of the contexts, the words can mean something different. Whereas for machines, they don’t get the underlying semantic between, behind that. What is the context that really changes the meaning of the word? So, the idea is to represent those words with some sort of numbers which captures those meaning considering the context in which it is given.
And not only the page categories, can we use and can we go a step further and represent each user journey as well, such that it captures the entire sequence in a way that we can compare these journeys as well from one user to another. Can we say that this user is very similar to another user in terms of its journey that they have taken towards making their decisions.
So, that’s the overall goal and that’s where embeddings come into picture.
So I know traditionally embedding has been used a lot of times into NLP field, but I want to take it further and use it in some of the prediction part, especially in customer journey prediction. So what embeddings essentially are.
Embedding is sort of a representation of categorical variable. Although, we can have embeddings for numerical as well, but in this case, we are focusing on a discrete category, two variable, and this representation of a number or a series of continuous numbers is such in such a way that, it captures the underlying semantics. So if you want to represent a particular page category, it should be placed nearby to let’s say, a price and finance related categories. We know those are more or less, very similar to each other. So when we want to keep them into a better space, they should be lying very close to each other. Whereas if you compare it with a homepage, which is not very related to your finance related page category, it should be little far from the finance related category. So, it’s basically a damaged 50 reduction technique. You can say that we are representing a word with a set of numbers. Now these numbers are nothing but set of features, which captures the most important aspect of that word. It’s not very different from image embeddings, where we capture, we represent image as a set of features, which defines that image. In this example, let’s say we represent king with a vector and man with a vector and woman with a vector and we, once we subtract the man vector from king vector and we added two women vector, the result should be very close to queen. So the overall idea is for embeddings to learn the context behind those words, to understand the words or the categories in similar way that we do.
So, in this case, if you want to represent price as a category, we can get embedding from that. And it can, again, embedding, can be for size that you want. It’s not mandatory that you need to have very large sized embeddings. It depends case to case. Sometimes you might want just 50 size of embedding, and sometimes you might look for 200 size of it. In this case, let’s say we want to represent price category with a 100 size of fixed length of vector and similarly we have specification and features.
The distinction is very clear. In this case, we are able to calculate how similar or dissimilar these particular categories are from each other. This case, specification and price, are not very close to each other. Price and features are again, not very close to each other. In fact, features and specifications are very close to each other in terms of, where the delay in the overall vector space.
So some of the immediate advantages that it provides you fixed size representation. It’s not that it’s going to be equal to number of categories you have in your data, but you can decide on what is the fixed size vector that you want to represent your category with. Second is, it provides or groups the similar categories together based on how the learnings of embeddings been done. So the embeddings are learned in such a way that the similar words or the categories would have similar representation.
Alright, and embeddings are not restricted to only a word or your discrete categories. You can only, it can also apply to image or text music and in fact, today we are going to see how we can get user embeddings, also user journey embeddings. So, once these are learned, they become very powerful and we can use them in different contexts and different areas.
So, now we are going to jump into how we can learn these embeddings. We know it’s a set of continuous numbers and they represent something meaningful, regarding or pertaining to that particular category. But there are three ways in which you can get these embeddings. One is when you use supervised learning and, but you don’t use actually a target variable in order to learn these embeddings. Second is with label. That means we use, make use of target variable, in order to learn these embeddings. And the third part is you can directly use the pre-trained embeddings that have been trained on bigger models of bigger corpus of data, and they are much more efficient and faster to use. So, let’s see the without label one first.
I’m going to take us into little bit of NLP now because it’s easier to explain embeddings when we are here learning through NLP. And so, “You shall know a word by the company it keeps,” by John R. Firth. So, one thing is for sure that words appear together in a particular context. Like if I want to associate laugh and joke, they’re more likely to be appearing together in most of the context. And if it is a first name or last name of a person, it will again appear together in most of the context. So there are two things that we need to remember in terms of learning embeddings that there are two key things.
One is context word, and other is target word. So if I take the example, the earth is round and moves around the sun. This is simple sentence, right? And we can see some of the words in this sentence. We’re not looking to learn these embeddings using a target variable, let’s say, if you’re trying to find out sentiment analysis, and if you want to find out if this sentence indicates a positive review or negative review, that is not the goal here. When we don’t use labels, all we need is a sequence of words or categories. And using these, we will learn embeddings and how we learn these embeddings, depending on the context in which the words are appearing. So let me just elaborate on this example quickly.
Like I said, there are two key things. One is context words, other is target words. The context words are those which appear around the target word. So you can call them as neighboring words as well. So in this case, if earth is our target word, the and is, becomes the context word. The earth is appearing in the context of the, and is. Similarly, if earth is the context word, the target can become the and is. So we can use it both ways. So how we learn the embeddings, we try to find out, given a word, can we predict, which are the neighboring words in the context? And given the neighboring words, can we predict what’s the target word?
And the way to do this, is to use a very simple neural network and essentially it’s called, word to act and there are multiple models to learn embeddings, but we are going to see word to act today. Some of you have already worked with word to act just, I’ll just keep it high level. So there are two ways in which you can learn these embeddings in word to act. One is continuous backup of words and others script around. What happens in continuous bag of word model that you try to pass the context words. So in this case, we pass the earth round and is, becomes our target word. Now there’s a hyper parameter which is window size, you can decide how many, how much window size you want to keep in order to learn these embeddings. But the overall idea is to pass these words as an input to your neural network. And this neural network contains a single hidden layer, this hidden layer tries to then learn the weight in order to predict the target word. So, there’s a convention backpropagation that happens to learn this word. And in order of adjusting the weights, the final layer that gets learned, that learns these weights becomes your embedding. So basically given all these words, what is my target word? And when the learning or training happens, we get embedding for each of the words. Similarly, for our Skip-Gram Model, most of the things remain same just that now we are using the target word to predict the context words.
So in this case, we just provide one input and we try to predict the rest of the full context words. And in the process, we learn the embeddings, which is nothing but our hidden layer.
So, coming back to our example of page categories, if we know for each user, we have set off specific sequence in which he has sort of gone through the journey. So he might have come to homepage, then offers, then finance. And in this way, we can consider this similar to a sentence. And each of these can be different words. So we, what we do, we pass these individual stages of the journey as input. And we try to predict what is our target word nearby. And in the process, we learn the embeddings for each of these page categories.
Alright, so this was the approach when we are not using any sort of supervision or using any target variables. We are just using the sequence of the information that we have and we trying to learn embeddings. When you’re trying to learn embeddings, you have to see how you can use the target variable in order to learn these embeddings really well.
So in our case, if we pass the sequence of the steps that the customer has taken as the input to the model, we build a deep neural network. And this first layer becomes our hidden layer or the embedding layer. The first layer and the number of neurons that you want to be present in the hidden layer becomes your embedding layer. So that much of fixed size vector you would get as an output from this layer and rest of the network, you can decide as per you want, you can have a number of layers as well as any number of neurons in your architecture, but the overall goal is to learn or optimize your loss in order to learn these embeddings really well. So for a given input, you know whether that customer converted or not, and what is your observed target and what is your predicted target? And you want to reduce this loss and by doing backpropagation and learn these embedding gates. Once your model has learned the embeddings, you can then use these embeddings as to represent the customer journey, or also to predict it, or to compare the different customer journeys and to represent these individual page categories as well.
Alright, so now that we have seen different ways to calculate embeddings, we can move on to the next step, which is, taking these embeddings and sort of tailoring this to have a specific needs. And we call them as custom embeddings.
Using any of the two approaches maybe word to act or with label learning, we have arrived at these embeddings for each of the page categories we have. Let’s say we have a 100 size of embedding vectors for each of these page categories. Now, if you want to visualize these embeddings, we can clearly see that embeddings are able to capture the underlying semantics for each of these page categories.
Whereas, categories related to test drive are close to each other. And categories related to service warranty reviews are closer to each other. Whereas, categories related to home page, offers, contact us, are relatively further from the rest of the categories. So the embeddings are able to learn the meaning of the individual categories, not only in terms of frequency counts, but actually these numbers are capturing the information that, that particular category carries. So, that’s how you can visualize these embeddings and see how it gives you more confidence in, to move towards the prediction part because now instead of just simply using a frequency count, you actually have a representation for each of these page categories to be used.
To take it one step further, we try to use these embedding to convert the customer journey into a representation using embedding as well. So in this case you would see our visitor one, goes through different pages. And for each of these page, we have already learned the embeddings. So what we do, we take a weighted mean of all of these individual embeddings and come up with a vector representing this user. So for this visitor, we combine all of these embeddings of these individual pages and represent it with a single 100 size vector, which is nothing but a weighted mean of this user. And similarly for each user, you have set of embedding for those page categories and you take the mean value for those, and you normalize those and you get the final embedding for each user journey. But to take it, to make it more interesting, we add little bit of extra, additional information to these embeddings. What we can do since we already have the time information, basically the time spent on each page of the website, we can add that information to these embeddings. So, for a given user, this is the user journey. And for each of these page categories, we already know what is the time spent by the user and what we do we take this time spent, and we sort of take, make a dot product and combine those information points to enrich the customer journey vector.
So what happens is, for each of the page categories, we multiply it with the time spent and make it more enriched. As a result, we get a final embedding for the user that represents the entire journey. It takes into account the order of the steps that the customer has taken. It takes into the account, the amount of time the customer has spent. And that’s why it, and it also considers the meaning of the page category to sell. So for example, some of the customers might be spending more time on brochure or specification, might not be very serious about buying the vehicle. Whereas someone spending more time on reviews and test drive is more likely to buy the vehicle. So these embeddings are able to capture that information. So at the end, what you get is user journey embedding, and which is represented by a fixed size vector.
But embeddings are not really useful by itself unless you visualize those. And in this case, we can clearly see on the left side plot, we have customers who are not very serious about buying the vehicle. This is built on a dummy data, but for representation purpose, we can clearly see the journeys of people who are serious about buying the car,
tends to have similar journeys. These are the embedding representation. So each dot represents a user. So in this case, we can clearly see the people who were very serious about buying the car, tend to have similar journeys. And in this case, on the right side plot, we have further break it down in terms of casual, serious, and actually who buys the vehicle. So, you can see which are those customers who are almost on the verge of making a decision to buy the car or buy the vehicle, but they did not. So their journeys are very similar to those who have converted and hence what can be done from business side to ensure that they sort of go to that decision making back and sort of make that decision in order to buy that vehicle, because they’ve already covered those steps, which are very similar to the people who have converted. So it’s a very powerful way to visualize your customer journeys through embeddings.
Alright, so, this is more of a static view, but in case you want to also see a very dynamic view of embeddings, there’s something that Google has provided that is TensorFlow Projector. Let me share my screen and show you how we can do this. So essentially you can load your own set of embeddings. You can load your metadata of those embeddings, and you can view those in three D space. What it does, it shows you how embeddings are able to capture the underlying semantics so that the similar embeddings are placed nearby. So if we search for good, you would see all the good related words are nearby to each other. And if we see exciting, you can see all the excited excitement, these are closer to each other in the embedding space, that if you see the car, there are too many words, which have similar embeddings, but they are well placed far from each other so that their embeddings represent different meaning, whereas, let’s say in this case, carpet and curtain are nearby.
So similarly, each word sort of learns at different embeddings. And depending on what the context it’s been used, we can visualize those in the embedding space. So, you can see how embeddings makes it much more powerful to be used in certain contexts. And some of the other advantages that you can have from embeddings is you want to find the nearest neighbors, be it customer journey, or if you want to recommend items, which are very similar to each other, we can use embedding site. We can also use these embeddings as input features to your machine learning models. It takes into consideration the order of sequences and hence makes it even more powerful. And finally, if you want to understand the relationships between different categories, instead of just having frequency count of those categories, you can actually use the underlying meaning of those categories.
Alright, so this is all that I had in this session. I’ve also added some of the additional resources that you guys can go through and you would find it useful. Thank you very much. I appreciate you taking out time for this session.
Bain & Company
Pramod Singh is a Team Lead at Bain & Company. He has over 10 years of hands-on experience in machine learning, deep learning, AI, data engineering, designing algorithms, and application development.He's the author of three books Machine Learning with PySpark, Learn PySpark, and Learn TensorFlow 2.0. He's also a regular speaker at major conferences such as the O'Reilly Strata Data and AI Conferences. Pramod holds Masters degree from Symbiosis and Business Analytics certification from IIM-Calcutta.He is also a visiting faculty to teach and mentor on ML & AI in different education institutes.