Scaling AI At H&M

May 26, 2021 11:30 AM (PT)

Download Slides

This session is a continuation of “Apply MLOps at Scale” at Data+AI Summit Europe 2020 and “Automated Production Ready ML at Scale” at Spark AI Summit at Europe 2019. In this session you will learn how H&M is continuing to evolve and develop their AI platform in order to democratize and accelerate AI usage across the full H&M group, including speed to production, data abstraction, feature store, pipeline orchestration, etc.

 

Our existing reference architecture has been adapted by multiple product teams managing 100’s of models across the entire H&M value chain and enables data scientists to develop model in a highly interactive environment, enabling engineers to manage large scale model training and model serving pipeline with full traceability. The current evolution aims to both reduce the time to introduce new features to the market as well as the learning feedback loop by democratizing AI in the organisation and persistent focus on sound MLOps principles.

In this session watch:
Bjorn Hertzberg, Lead Data Scientist, H&M

 

Transcript

Bjorn Hertzberg: Hi, my name is Bjorn Hertzberg and I’m the head of data science at H&M. I’m here to talk about the work that we’re doing to scale AI across the full H&M group today. This presentation will be at quite a high level, talking about the group wide approach that we’re using to scale AI across all of H&M, including all brands, and it’s an initiative that we call the Fountainhead. At the end, I’ll also spend some minutes talking about a specific AI platform that we’re developing called the Content Personalization platform, going into some details and some code. But we’re starting off with some facts about H&M, then the Fountainhead vision, and then the Content Personalization platform.
For the people that don’t know H&M. H&M is a leading fashion brand driven by the desire to make great design available to everyone. We aim to inspire and enable people to express their unique style. And even though many people will recognize the H&M brand specifically, they may not know that we’re a group, a family of brands, most of which operate in at least 30 countries. Each of these brands is catering to a unique and well-defined customer group. This breadth and depth of customer’s insights that we get from this family is one of the key advantages, as I see it.
Each brand has a specific need and requirement in how to use AI to cater to their customer base, and they’re only loosely coupled to H&M. Meaning that they can pick and choose what kind of initiatives that they want to drive in their brand. Sometimes that creates a lot of interesting combinations that really helps drive the development for us, but it also creates a great learning base understanding how different brands and different companies use these AI initiatives. Each brand is also big enough to actually become a reasonable customer from any third party providers out there, so the needs are well defined, well structured and sometimes aligned, but sometimes not.
The pooling of this data and these requirements creates a really good basis for a robust learning journey, both for our teams and for our machines. And exactly how we’re planning to cater to these diverse needs, always keeping focus on the learning and our customers, how they want to be treated. That is the topic for today.
H&M was founded in 1947, and today we employ more than 153 people worldwide. It’s one of Sweden’s modern success stories with today physical presence in 74 markets and 4,950 stores at the moment. H&M’s growth strategy has, for a long time, been to open new stores, simply because that was how we grew. It was driving growth. And if we realized that we just had bought too much clothes, we just needed to put in extra effort to open a new store and we would be able to sell the garments.
However, all of this changed in 2015. Digitalization had had an impact on H&M sales for a while. But in 2015, it became apparent that our business model was not performing as expected. The H&M stock peaked in March 2015 and the three years that followed, the stock price fell by two-thirds. The value proposition, making great design available to everyone, was still intact, but the growth proposition of opening more stores was more or less meaningless when retail consumption was moving to online channels and online devices.
In parallel with investment in the digital platform, H&M started investing in bootstrapping AI capabilities in house with the help of an external consultancy. The payback time of the investments during this phase were actually phenomenal. With our scale, we were able to pay back the full investment within one year. And in 2018, we established the advanced analytics and AI function and started rolling over the pilots to full-time employees instead. In 2019, we started industrializing the use cases. And with that, I mean that we were focusing on process mapping and aligning the architectures to a common reference architecture.
Most initiatives today are run through the AI Foundation. The AI Foundation is a team of 120 people skilled in building AI systems for production. And today, we have products in production all across the value chain in H&M, starting with quantifying the assortments, deciding on where to produce, and support for negotiating a reasonable and sustainable cost. How to allocate the assortment between warehouses and stores, deciding on an optimal selling price, and recommending the right item to the right customer at the right time.
We have been successful, and this is in part because of the way that we have been working with products throughout the life cycle. And in part, our efforts to reduce technical debt in all stages of development. And in part also, because we’ve had a constant stream of interesting use cases that immediately can scale globally.
Today, we run three tech enablement teams as well, focusing on extracting new best practices from our teams that can be codified and put into production for the benefit of both existing use cases as well as new products to get them to market faster. These are the knowledge capture and best practices team and the AI platforms. The third team is dedicated to bringing in new use cases, starting them up from scratch and then bringing them in as any other use case.
That was it about the history and the AI Foundation that we have today. Now, I want to switch over and talk more about how we want to support the whole group with a new initiative that we’re calling the Fountainhead, and it really builds on the journey that we’ve had so far.
The biggest challenge that we have faced is that of quickly scaling our AI initiatives to the adoption rate that we see internally. By 2025, all core operational decisions will be amplified by AI throughout the H&M group. We’re asked to quickly scale recommendation engines, computer vision, natural language processing, and much more in just a matter of months. Naturally, this requires both organization, but also a skill set that is really, really hard to recruit for.
The process of defining the Fountainhead initiative was something that required a lot of effort. We’ve had five workshops, we’ve surveyed 33 products internally, and we tried to leverage several different means of getting an outside-in perspective on our organization and our best practices. This has given us the insights that we need to radically change how we work, and how we want to grow use cases, and how we want to leverage people in the organization.
In order to unlock more value and reduce time to market, we need to focus more effort on both our people and our processes. And for people, it really starts with knowledge. Finding the right people, putting them on the right problems, and giving them the support they need to succeed. This means really three big initiatives. The AI Literacy Project that we have and bootcamps for bringing people up to speed with the vast responsibilities that they have, as well as supporting a learning culture where we constantly learn and improve.
But also targeted recruitments, both to gain speed in the development of use cases, but also to gain insights from leaders in the field and support our learning culture by spreading that knowledge in our organization. And then we also need to match skills with desires. We need to limit the time people work on things that they do not love. As we know that people will always perform better when they do things that they love.
For processes, it’s really about creating the organizational structure that allows our people to thrive. Bringing the AI Academy to support the learning journey for everyone in the group, not just the AI Foundation, but everyone needs to understand AI. And building the tool sets and AI platforms to support separation of duty and matching skill sets with desires, that’s really what we want to achieve.
This is the entire structure of the Fountainhead with different layers. And it is really an umbrella under which we capture all of our initiatives to build an enterprise wide AI platform, including people and processes, culture and tech. With our goals of supporting all of the brands in the group and all of the product teams on the same mission, we’re launching capabilities to support a system wide approach to flowing AI use cases from idea to MVP and into production. Also, a separation of duties between core enablement teams and product teams.
Some of these initiatives include, specifically for flowing AI use cases via AI Literary, educating everyone on basic principles of AI, but also raising the bar through a thriving learning culture. We want to provide POC and dev support throughout the organization. Infrastructure as code and common standards, model development frameworks, rapid exploration of reusable output throughout the organization, and documentation and code work.
And for the separation of duties, we really want to have production and deployment infrastructure in place. We want to have product life cycle management and industrialization toolboxes. We want to have a control tower and measurements infrastructure available for these platforms, and visualization out of the box and model life cycle management. Everything out of the box.
The previous slides had mostly been on history and theory. Let’s take a look at a more hands-on example of this. The content personalization is one of the AI platforms that we’re currently building, and we’re building it in customer AI to support the customer domain. So everything that’s customer facing.
But let’s first take a step back. In this slide you can see our current reference architecture. I’m not going to go into the specific details of it. We’ve worked on reference architecture for a while and been talking a bit about them externally, and they have served us really, really well. Use cases are allowed to deviate from these references, these reference architectures, but they need to justify their decision. The flexible approach of an architecture, or of these architectures, allow existing use cases to slowly adapt to the reference over time. But it also allows new use cases to experiment with new systems that provide learning opportunities for us, which guide the further development of the reference architecture.
But for this presentation, it’s actually the middle layer here, the abstraction, that is the most important part. Essentially, we treat models as a message. A message with a very specific interface. And this message can then be transported on a message queue. The abstraction of these models, as messages, just means that different teams can do different parts of the process independently of each other. And by extension, we can create platforms that allow mobile management employment to be maintained by central teams. This has a huge impact on our ability to match skills with passion, because essentially, all that a use case needs to focus on, if they so choose, is the actual implementation.
Now, to remind us of how big an impact this can have, we just need to go back to the now famous Google paper on technical debt in machine learning systems. And this little box here is the VML code. If we standardize the interface between models and the serving infrastructure and build all other systems around here to support the serving infrastructure and the ML code where applicable, we have reduced the amount of code that we need to develop in order to scale a new use case drastically. And the content personalization engine aims to automate the flow of use cases. The AI platforms really aimed to enable this process. The content personalization engine, specifically, aims to serve models in real time on our website, or in the future in app and in store as well.
For the model serving, we’re building the platform around the cell phone platform. And one reason why we like the cell phone platform is that it allows us a lot of flexibility to define release strategy and experimentation independently, which is a must if we are to develop an AI platform that can serve multiple use cases with unknown needs.
So for instance, we can choose to release a new model, either just by switching or using a shadow implementation or a canary implementation. And we can do that independently if we also want to run an AB test at the same time, or running a multi-armed bandit strategy, something that we’ll go into more detail later on.
The way that this is really done is by having what they call an inference graph. So in this case, we get a message from the user, do an input transformation, and then a router sends it out to one of three different models. In this case, as we have two backup strategies for the multi-armed bandit strategy, and then we can have an AB test.
And this could be for instance, a case where we want to run the model as the primary model, the multi-arm bandit model, but then test if changing the images has an impact potentially as a function of the time of day.
This is the content personalization platform that we have. The architecture. And the use case is really that a user comes to our website and we want to render some kind of webpage for them based on who they are and other context. We get information in two stages. First, by the predict stage where we decide to show something to the customer. And later we get the reward from the customer based on their interaction with the webpage. And at a final stage, we need to combine these two data sets in order to have a model beta that we can use for retraining the model.
In the predict stage, the request comes in from the customer via an APA gateway, and the request is enriched and fed into Seldon running in AKS via Istio. Seldon processes this inference graph, where we can define exactly what kind of model that we want to use, and if we want to use AB testing or something else. And then we output the results, both to the customer, but also send to an Azure event hub. And we capture the raw output from the event hub into the Azure data lake, as well as processed by a stream analytic or an Azure function and store it to an SQL database for later processing.
In the rewards stage, the customer has done something on the webpage, potentially clicked one of the banners or something like that, that we’ve presented to them. And we process that with an Azure function or something like that, to get a reward. We then put that on the event hub and save both the raw data into the Azure data lake, but also pick it up and push it to the SQL database.
In the SQL database we have two jobs to run. One periodic join, joining the reward with the prediction, and then periodic purge, just to keep the database small. And from there, it’s picked up by an Azure pipeline for model training, or we push the data into the Azure data lake. And from there, it’s picked up by an Azure pipeline for model training, registration, test, and deploy to prediction. And this will typically be running batch overnight currently. The pipeline will also trigger the build of a [inaudible] image that’s registered in the Azure container registry and ready for being improved and deployed.
So this is one of the use cases that we’re currently building on the content personalization platform, and it’s a teaser, changing teasers on the start page. So currently, the start page is built up by containers that are placed on the start page in order from top to bottom. And there are different types of containers. For instance, teaser container, segment teaser containers, banner containers, et cetera. And every container can be linked with content, and that is the actual teaser or a banner or something like that.
And for this use case, we have created a new container type called the slot container, which is basically just place holder that can be placed on the start page, where we want it to be. And when the customer comes to the start page, the front end is responsible for sending the request to the prediction API with the required data to be processed by the model.
And if there is a response, the slot containers will be loaded with the teasers that the API suggests. And if there is no response or an empty response is fed back, then the front end will simply load the slot containers with the regular teasers in the default order.
So, in this case, we are actually using contextual bandits. I’ll briefly describe what they are. So essentially, it’s a type of reinforcement learning strategy where you have an environment and an agent, and the agent interacts with the environment based on a state, performs actions, and then gets rewards in return for this.
Now, in the general case in reinforcement learning, this loop continues for awhile, but we’re using a class of multi-armed bandits, which is when this process is just entered once. So the agent is presented with an environment and learns about that environment through the state, takes one action and then gets a reward. And that’s really the setup when a customer comes to a webpage. The user or the customer is the environment that our AI model, the agent, sees, the user presents their state, who they are, the agent reacts to that, takes an action, we present some banner, and then we get a reward based on that.
So the general case is the multi-armed bandits. But in general, those are harder to learn because you don’t learn too much across customers. So instead, you can use contextual bandits to decrease the solutions base or the learning problem. And that’s what we’re using today, contextual [inaudible].
This is an example of the API that we’re using. So the recommendation request API has information about the session. For instance, the customer ID, user features such as the device, and customer segments, and the action features. So the different teasers that the algo can choose between. And in this case, we actually also have different sizes for the banners, and that becomes a slot in the small. So we’re actually using slate based or slot based contextual bandits in this case.
And here we have, based on the input that we saw on the previous slide, here is the action that our agent has chosen, either as a treatment if it’s a … We can do AB testing in this case, or if the test group is in treatment or controlled, essentially. The return API will be different.
So here are our recommendations in the order of the slots on the webpage. And here, if we choose, if it’s a control, if it’s the B sample, we don’t send back anything, essentially. It’s an empty prediction. Note also, the probabilities that we send back, these are only used for the learning of the optimal strategy and not presented on the front end.
And the last stage of the contextual bandits is really to learn about the reward that we get from the environment. And in this case, we can see that the user has clicked on this specific teaser in slot one.
So takeaways. Scaling AI requires autonomy and alignment, and that is really what we’ve been doing throughout our journey, both with regards to our reference architectures, but also with now the Fountainhead initiative. We want to allow people to have sufficient autonomy to run as quickly as possible. But still, we need to have alignment in order to create synergies for the organization as a whole, and also develop in the way that we want to. The balance is tricky, but it is crucial. And it reminds me of an African proverb that I think is especially fitting. And the proverb is that if you want to go fast, go alone. But if you want to go far, then go together. So if you want to go fast, you can allow a lot of autonomy, but if you want to go far, then you need to have some alignment.
The structure that you want to have in your alignment is really allowing the structure to crystallize outwards from the team. Excessive alignment just stifles growth, and therefore, you want to have the minimum alignment that you can allow in order to get to the best results. And therefore, always go internally. Talk with the teams, understand what the requirements are, and don’t look too much to the outside world. Get inspiration, but listen to your internal teams.
And something that’s really been working for us very well is by focusing on unlocking flow by standardizing and simplifying. In order to grow our AI initiatives across the group, we need to be thinking about how we empower people to create the flow by standardizing the tool sets and simplifying the work that they need to do.
And that’s it for me.

Bjorn Hertzberg

Björn has 20 years professional experience from data & analytics, mostly from financial markets where he has been working in both in trading, asset management and risk management. In the late 1990's...
Read more