Frequently Bought Together Recommendations Based on Embeddings

Download Slides

We are the recommendation team that performs Data Engineering + Machine Learning + Software Engineering practices in “” which is the largest e-commerce platform in Turkey and in the Middle East. Our aim is to generate relevant recommendations to our users in the most appropriate manner in terms of time, context and products.

One of the many recommendations we serve to our clients is the “frequently bought together” products. Generation of “frequently bought together” recommendations of millions of products to millions of customers is a challenging process which requires specific approaches. There are many steps the recommendation development team must take to achieve this goal.

In this talk, we plan to explain the problems we have overcome, from training a model to productionize it, following the metrics of the model in production and keeping the model updated.

Our tips and tricks to be shared with the community, are as follows:
1. Embedding based recommendation
– Context and arithmetic operation problems
– Pros and Cons
– In which cases you may need dimension reduction
2. Offline Metrics
– Hyper parameter tuning and pre production check with Mlflow
3. Pipeline
– Etl (Pyspark + Oozie) and serving layer based on continuous delivery mindset
4. Experimental UI
– Why do you need a manual control mechanism for such a product
5. Embedding Serving Layer
– Knn Search
– Hnswlib vs others.. Pros/cons
– Programming language and environment selection for serving layer
– Post Processing Needs; Metadata, filtering and sorting options
6. Online Metrics
– Why are online metrics better than offline ones
7. Do we need a more complex model or better tricks
– Time and position tricks can be better than a much more complex model

Speakers: Mehmet Selman Sezgin and Ulukbek Attokurov


– Hello to everyone, welcome to our presentation. Today we will like to tell you about production use case of embedding basis recommendations. Specifically we’ll be talking about frequency about together recommendation. My name is Ulukbek, I am working as a data scientist at Hepsiburada. My colleague, Selman, is working as a senior data engineer at Hepsiburada. In the first part of our presentation, I will describe to you about modeling stage of our project. In the second stage, in the second part of our presentation, my colleague, Selman, will describe you the general architecture overview. Let me shortly introduce about our company. Hepsiburada is one of the largest e-commerce company in Turkey and in Middle East, it has more than 30 million products in more than 40 categories, ranging from retail food to electronics, it has high traffic, the number of visitors per month is around more than 200 million visitors per month. One of the main topics of our presentation is embeddings. An item is represented in a low dimensional space using embeddings. For example, embeddings represents a much correlation between the words in a sentence by projecting into the low dimensional space. Why we decided to employ embeddings to generate recommendations, our own system use co-occurrence basis statistics, specifically, Saiton89 which is TF-IDF like metric. It use behavioral data such as product views, add-to-cart and orders, it’s an item based recommendation, so users and items cannot represent in the same space. There are different types of methods to generate embeddings such as Resnet, Word2Vec, Doc2Vec, Bert, etc. These type of models might be useful in content based recommendations, where image or text data is used. In addition, embeddings generated from image, text and behavior data can be concatenated to represent different dimensions of the same item. Then the main drawback of co-occurrence recommendations is that recommendations are not generated if two items do not occur in the same sequence. Also, context information such as neighboring items are employed. On the other hand, embeddings can be used as features in unsupervised and supervised methods. For example, embeddings can be used in LSTM to model the sequential behavior of customers. In our case, the result embeddings are used in KNN to find the most relevant items. The main goal or frequency bought together recommendation is to recommend multi-alternative but complimentary products. One of the main challenges is a diversity of sequences, where products from semantically different categories are included in the sequence. Another issue is related to the scale of recommendations. We are generating recommendations using more than 30 million products distributed over more than 40 categories. We should remember that both together recommendation does not mean that items which appear together are similar. Simple co-occurrence counts are related to first order co-occurrences while embedding is used to calculate the second order co-occurrences or similarities. In our case, we have employed Word2Vec model. Word2Vec is a simple model, it’s easy to train, you don’t need any special symbols in training samples. Furthermore, it has a lot of implementations retrain it for specific business needs. Let’s talk on data preparation. User behavior such as product views, orders in a specific time range is considered as their standards. In this case, the set of purchased items correspond to backup verse model. Let’s assume an order included keyboard computer and mouse. The corresponding list of items constitutes one observation in a training set. As discussed in the previous slide, sequences contain products from diverse categories. So in order to decrease noise, the recommendations sequences are split into sub sequences. In the given example, keyboard and mouse are included in the first sub sequences, whereas choose them stocks are contained in the second sub sequences. This keyboard and mouse belongs to electronics category, while shoes and socks are items related to shoes. In the goal given code snippet, context separation using gender and product, product category information is given. Sequences are grouped into sub sequences, where products with the same category label and gender information are found in the same sequence. Also, length sold set sequences is limited to contain maximum 20 items to support relevancy of recommendations. Let’s talk about Word2Vec parameters. All the consideration should be taken into account while training your model. Coverage decreases when mean count is set to lower values. Mean counting network size parameters are set as minimum as possible to increase coverage and to decrease the storage and computational costs respectively. Window parameter is set to the maximum sequence length, since orders are random. We use KNN to find the most similar items. Relevancy depends on the selected distance metrics. For instance, Euclidean distance metric measures the distance between two points, cosine similarity measures an angle between two vectors. That’s why Euclidean distance affected by vector lens. We have used standard evaluation metrics such as precision at K, recall at K, hit rate at K to take to evaluate our models and to trace the performance of the model on the production, it might be difficult to measure the accuracy of embeddings, however, standard metrics at least reflect general picture about the performance of the model. We use mlflow to change the parameters and evaluation metrics. It has user friendly interface to investigate metrics visually and graphically as shown on the right. Also, it makes easier team collaboration since experiments are stored in central server. In a screenshot of mlflow, while hyper parameter tuning of the model is shown, the highest precision is depicted with red parameters corresponding to the highest precision value, our values are connected with red lines. As shown on the graph, precision ensures the highest they’ll use when mean count equals to five and negative sampling parameters below zero and frequent sample parameter is set to minimum value. Here is an example of integrating mlflow in the source code is shown, on the 24th falls line mlflow is started and parameters are loaded on the 30 each line. Evaluation metric so we’ll get inside the function, name it “Run Training Function”. You can apply arithmetic operations on embeddings. For instance, I could ask 10 of my friends can be an average of their product embeddings open it print embeddings might be used to calculate brand similarities. A few examples of arithmetic operations are given on the slide but you should remember that the accuracy decreases when the categories are represented by the embeddings of low level entities as product embeddings. Brand similarity is relevant. If a brand contains homogeneous products. For instance, if a brand is a shoe brand, then it’s the representation using average of product embeddings might be more accurate compared to the brands which contains products from diverse categories, which is shown on the right. Thank you for your attention. We have discussed the modeling part of our project. Now my colleague, Selman, will describe you the general architecture overview of our project. Please Selman, take your time.

– Thank you Ulukbek. Now I will give you information about general view of architecture and production and phases of the model. We need a system that makes it possible to train a new model, evaluate it with offline metrics, experience it with a user interface. So we started designing the system shown in the slide. The system has two major phases, experimental phase and production phase. After making all the measurements, if the modeling and indexing results are successful, we can put the model to the production system and monitor it with online metrics easily. The system that we will design and implement should queue and continues delivery ability to recommendation team, continues development and deployment ability is very crucial when you create a system that needs incremental improvements. In the slide you can see the experimental phase on the left that starts with data preparation with Context Aware manner. After that training model, measurement of the offline metrics process and metadata generation comes. Metadata includes category he or she gender, price and brand. These metadata information are used by post filtering actions. After creating an embedding model and creating end product embeddings with this model, then it comes to KNN indexing. Indexing phases when the internet HLS would love you to come to the scene, we create a KNN index tree with embedding data and with metadata. A binary file is generated from this indexing process. KNN Indexer push the binary file to a file storage with a test label if all of fly metrics of model training and index creation are satisfying. After index creation, we are ready to start an API with the index binary that was created by KNN Indexer. API application can query recommendations very first on the generated index. When API starts, it gets the final binary file matching level from binary file storage. But we can try and see the recommendation results with Experimental UI at this face and decide to post filtering parameters. If the experimental check is also okay, then we just change the direction of deployment to the production and make the model results available to customers of After production as in the model, we always take the online metrics we are a superset, which uses Apache hive as the data source. Daily and weekly online metrics controls are very important, we may create enhancement tests using these metrics. These enhancements could be done in model training, index creation and post filtering parts of the system. While creating the system and data pipeline, we need to make some tricks. I’m going to mention them briefly and show some code samples. After model training, we need to create embeddings of all products that means processing millions of them. Creation of embeddings on a single machine takes many hours to finish or to overcome this time consumption issue, we use PySpark Pandas UDF. Thanks to Pandas UDF, we could parallelize the embedding process using our existing Hadoop clusters power. It takes less than one over 50 of the single machines time consumption. If you need to use two Pandas Functional Spark DataFrame, be careful. This conversion consumes more memory than expected. You can see the code sample for the Pandas UDF scenario that I mentioned, embedding creation is done in UDF with a distributed approach at line 28. Code creates random group ID for all records to distribute them over this group ID at line 42. At line 45, create embedding function is applied to the dataframe using GRP column as a group of value. So if your group count is 50, and you have 500,000 throw that because to create embeddings, then every node of the cluster processes only approximately 10,000 records, and every node does this job in parallel. After the executors jobs are done, all the embedding data will be union in the driver node, then you can do other processing on it. In our case, the next process is writing to it to the HDFS. For measuring the model quality, there are three important examination components. Offline metrics are measured at the training time of the model and the indexing time of the embeddings. Experimental UI provides examination with the human eye and makes it possible to decide about the filtering parameters. Online metrics are measured after the production phase to see the real effect of the work done. We use nearest neighbor algorithm to create KNN index. Thanks to this KNN index, we can quit over it and generate recommendations. Before deciding the library for the nearest neighbor algorithm we check ANN-Benchmarks. There is a graph on the right where you can see a performance overview of the libraries or details you can visit After that, we decided on programming language and choose Golang for both indexing and serving layers because of Golang’s small deployable executable binary and lightweight sub-routine mechanism. Other than these, we checked similar function possibility is different from pleiadian distance and if we needed a distributed index or not. If your case makes it hard to create index from scratch, then use a library that makes it possible to incremental item insertion or deletion. In our case, it’s not a necessity. We decided to go with HNSW Library, Golang and custom development post filtering layer. There are trade offs for decisions or made over parameters of pooling the nearest neighbor of three with HNSW Library. If you decide to keep the three simple and lower the neighbor count it results with less resource consumption but poor recall value. If you increase the trees complexity, you will get higher recall values but be careful about total resource consumption, it will become a waste of time and resource after that point. You need to choose optimal values for indexing parameters. The fact that the embedding vectors of the products are close to each other in the vector space, doesn’t require these products to be shown to each other as a recommendation in all cases. In many cases, a post filtering of the recommendation product list is required. In this way recommendations that may seem unrelated from the point of view, brand, price, gender and category can be eliminated. Also, these two brands cannot be shown with each other or do not show adult products with health products type, business requirements necessitate the post filtering. Creating KNN index with post metadata included enables us to apply any kind of post filtering. Now we can see three examples of simple filtering functions that we implemented in our serving layer. If your recommendation context is clothing, then you may want to recommend products with same gender meta info or if you are recommending an electronic device to another one, then recommending a $10,000 device for $100,000 headphone may not be a good idea. Using these metadata while training the model is another way to handle this situation, but it causes more complex training process and also doesn’t guarantee the results as in the post filtering process. While deciding the parameters of post filtering, you need an evaluation medium, we implemented an experimental UI for this purpose. With the try-fail method, we can decide the parameters like category, price, brand, gender, etc to gather the best results from the model and we can do it by seeing. Off the decisions made by the team, we can use the desired parameters in the implementation of certain post processing layer. Performance metrics of the serving layer are very important for us because we are serving millions of customers as the recommendation team of So our service layer has to handle large scale of traffic with low latency. For being sure about performance quality of our serving layer, we applied load test to our API application. For a single instance of API application, you can see the graph screenshot taken from the graph on a dashboard. Single instance of application responses up to 8K requests per second on the one millisecond response time. Here are two examples of frequently bought together recommendations from our production system. On the left for a laptop computer, there are computer accessory products recommended. On the right for a camping tent, other complimentary camping equipments are recommended. Best way to understand the real success or fail of a recommendation model is performing it on the production, gathering clickstream and order data and analyzing them under the well-known headings which are conversion-related metrics, coverage-related metrics, and all the revenue-related metrics. These metrics can be grouped in terms of different dimensions such as placement title, channel and gender. Using only CR, CTR is not enough, try to see the big picture with other metrics. Be careful about popular products and category contexts. Good CR value for one category may not be good enough for another one. Separation of contexts using the right dimensions gives you a better inference. As final words, we have suggestions for you; using embedding representations in recommendation domain is a good and working method. Word2Vec training is a simple job but tuning the parameters wisely is important. Using offline performance metrics are also crucial to get best from the model. Application of arithmetic operations may not work as expected, it’s more complicated than King minus man plus woman equals to Queen equation. Be careful when using it. Using agile methodology while developing such a system provides seeing the guidance and successful results as early as it can. Giving up to them an experiment and putting the effort on the most promising experiment buys you time. Scale is the number one question while designing these serving layers. Experimental UI is a good way of evaluation of a model and good medium for deciding the post filtering parameters. And finally, never ending process. Keep tracking the online metrics. Thank you all for listening us. Please ask any questions you want, we will be in Q&A chat room. We wish everyone a good summit.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Mehmet Selman Sezgin


Multi-talented Senior Software&Data Engineer successful completing simultaneous projects. Willing to jump in to develop “outside the box” solutions. Talented project leader and complex problem solver with results focused and driven approach. Data enthusiast.

About Ulukbek Attokurov


My name is Ulukbek. I am working as a data scientist in recommendation team of HepsiBurada which is the largest e-commerce platform in Turkey. Mostly, I am responsible for analytics and modeling tasks in our recommendation team. I have been working for the recommendation team of HepsiBurada for one year. Before I worked at companies such as Vodafone, DenizBank and Insider.

I pursued my Master's Degree from Istanbul Technical University and my thesis was about multi-document summarization task.

My main research area is applying NLP algorithms to recommendation systems.