Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platform, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Offensive Text Classification, Product Matching, Compliance use cases, etc. Extracting text is achieved in 2 stages. Text detection: The detector detects the character locations in an image and then combines all characters close to each other to form a word based on an affinity score which is also predicted by the network. Since the model is at a character level, it can detect in any orientation. Post this, the text is then sent through the Recognizer module. Text Recognition: Detected text regions are sent to the CRNN-CTC network to obtain the final text. CNN’s are incorporated to obtain image features that are then passed to the LSTM network as shown in the below figure. Connectionist Temporal Classification(CTC) decoder operation is then applied to the LSTM outputs for all the time steps to finally obtain the raw text from the image.
– Hi, so I’m Rajesh Bhat, I’m working as a Data Scientist at Walmart Labs, Bengaluru. So finally I’m pursuing my master’s from Arizona State University. And I’m a Google competition expert with three silver medals and two bronze medals. So in today’s session I’ll be talking about Text Extraction from Product Images.
So this is how the Agenda looks like to start with I’ll give an introduction to Text Extraction task. Then I’ll be talking in detail about the Text Detection and Text Recognition models, which are like the building blocks for Text Recognition. So later I’ll touch upon the CTC loss calculation part and I will talk about other advanced techniques for Text Extraction and finally, I’ll be happy to take the questions.
Yeah, so this is how the Text Extraction Pipeline looks like. So given an image or an product image, first of all we need to know where exactly the text is present in the image, right? So that is called as Text Detection. So once we know where the text is present in the image, so we have the crop those regions and send it to the Text Recognition model, okay? So the Text Recognition model crop images taken as an Input, and finally we get the raw text as an Output, okay? So in the below image, below example you can see exactly the Text Detection and Text Recognition steps. You can see the bounding boxes after the Text Detection task and those affect to the Text Recognition and finally we get the Raw Text out of it, right? So there are many domains where you can plug in this Text Extraction (mumbles). So it’s like, let’s say in retail, right?
Let’s say we take the domain of retailer eCommerce, right? So, I mean, we have the product catalog and it so happens that many times the product catalog is not fully clean or some of the values that are missing in the product catalog. So if we have the product images for that, then we can extract the text from it and maybe after extracting the text, we can do (mumbles) extraction or entity recognition then we can fill those missing values in the product catalog. Let’s say if there is any offensive content in the Product Image then we don’t want to show those kinds of images in the dotcom website, eCommerce website, right? So in that scenario as well, if there is any offensive content, then maybe we can do the Text Extraction first and then do a classification of saying that, is there any offensive content or not? Facebook also has similar functionalities. So whenever a user uploads, an image, so they check that if there is any offensive content in uploaded image or not. So it’s basically like social media content monitoring system.
There are like a lot of other use cases as well. So, but these are like few use cases which come to my mind right now, yeah. So, next is, talking detail about the Text Detection technique. So I’ll be talking about Character Level Detection Technique of which is achieved through Image Segmentation, okay?
Yeah, so I’ll just give a brief of what it Image Segmentation is. So, basically given an image, right? We want to find the different segments that are present in this digital image. So let’s take, this cat as an example, let’s say there are like two segmentation, one is like cat and the background, right? So, basically given the Input as a calculation of the Ground Truth look something in the (mumbles) right? So as you can see in the right, so all the pixels which are belonging to cat are marked as ones and the background pixels are not as zero, so not cat, right? So that’s a binary mask. So it’s basically called as mask. And, so this is how the Ground Truth looks like. So given a new cat image, we want the Output to look something like this, whatever you’ve seen, right? It’s given a new image. And this is just an example with the giraffes, same thing applies, and you can see the binary mask (mumbles). So wherever the object is presented, those pixels sum up and others are treated as background. Same thing applies to the fixed detection scenario, right? So wherever the characters are present, you can see it is being mapped and the rest of the region is not mapped. So this is how the Ground Truth looks like for Text Detection task, okay? But how do we generate this Ground Truth? Because this data generation itself, the training data generation for this kind of task is very expensive, right? So now we’ll see how we can generate the training data for Text Detection Task. Okay, so we have like Character Boxes over here, right? So given a bold and for each characters, we have the boxes. It’s like character level annotation is present. So using this information, all of the information that is presenting the slides needs to be generated. So basically I’m talking about the affinity boxes and then finally getting the masks out of it. So if Affinity Boxes are mainly for telling that two consecutive characters belong to the same word, right? So let’s take an example, let’s take B and D. There is a affinity box between E and D. There’s affinity box within that. But if you take a word E at the end of the word and if you speak the word badly over here. So there is no affinity box between these two, right? That means these two are coming from two different points. Okay, so Affinity Boxes are mainly for consecutive characters. If they’re part of same word, okay. Now let’s see, how do we generate this Affinity Box, okay? So given the Character Box and let’s take two characters now, B and D, and the Character Boxes to that. So what we have to do is we have to join the diagnosis. So whatever you are seeing in the do, right? So those are the diagnoses. Once we joined the diagnoses for both these characters, we get the next triangles. Next step is finding the centroid of these four triangles. Now, if we have four triangles find the centroids and basically join this centroids. So once we joined the centroids we get Affinity Box given two characters. Same thing applies for other character assessment. And finally, you can get the Affinity Boxes for every like two characters and between two characters. Now, once we have the Affinity Boxes and Character Boxes, next step is to create the Mask. In the (mumbles) segmentation example, it was a binary mask, but in this case it’s a 2D Isotropic Gaussian, so it’s a continuous values now. So what we’re doing is we’ll take the 2DIsotropic Gaussian, and then we try to fit this window particular box B, your Character Box or an empty box. And you can see the transform to the question over here. So this is nothing but mine for each character, this is the mask now. And same thing applies for the Affinity Box. And finally, we get good Region Score Ground Truth and Affinity Score Ground Truth, right? So these are Gaussian Map, these are like continuous values now. So if it was a binary mask in the that segmentation example, we could have used binary cross-entropy as the last function of entering the model. Since these are like continuous values, we can’t use a cross entropy, so how to use (mumbles).
Now let’s look into the architecture, the Model architecture for Text Detection.
So as you can see in the right does this like very similar to the U-Net architecture, so U-Net architecture is like pretty famous in Image Segmentation Task, basically in medical U-Net segmentation, those kinds of things.
So, basically we have like batch normalized, Washington, VGG16 as a backbone over here, and then like skip connections, very similar to U-Net architecture, and then like, upsampling blocks over here. So given an Input Iimage, the Output is though Region score and the Affinity score.
And this is actually a published paper in (mumbles), does this by AI research. The paper is called as Kraft, (mumbles) Text Detection. So there are several other techniques also for Text Detection word level, Text Detectors are available, but the problem with that is if let’s say, if the word is (mumbles) or something, right? If it’s arbitrary shapes and the detection will not be that accurate, if we are doing it on a character level, then as we’ll see in the next (mumbles), the detection is a very accurate.
So these are the sample images, Output images which are taken from the paper itself. So given an image, the Input image which is in the top, I forget about the annotations are here, but identifications that access and (mumbles). So the Output from the Text Detection Model is the Region score and Affinity score. So I think the scores are tells that how two characters are related to each other. Are they part of the same word or are they not part of same word. So as you can see, there are like two words here. But there is no Affinity Region coming between these two words.
Also since the I mean, the segmentation is happening on the other prediction is happening on the character level. So we can detect the arbitration extortion (mumbles).
To get the bounding box what we can use this, we can use the functionalities from open CV. So basically we have like connected component and a minimum idiotic angle functionality set up open city, we can leverage that to find that bounding boxes. Now we’ll see how the Output looks for the product images. So given a sample Input sample product, so you can see that finally the bounding boxes that are coming out from the Text Detection Model. So this was about the Text Deduction, now we’ll see how we can leverage the Output from the Text Deduction and (mumbles) Text Decuction.
Firstly talk about how we can prepare data
for the Text Recognition Task. So I am using the library called Synth Text for synthetically creating the dataset.
Pending dataset for the Text Recognition. I can’t manually given a crop dimension, I can’t manually annotate all those images. So to train a deep learning model, that will be very difficult because deep learning needs lots of data and preparing those data manually is a very difficult task. That’s why I’m creating the dataset synthetically.
Depending on the use case, so for me it was like I had to extract the extra product images. So what I did was I took the product descriptions and product titles from the catalog and synthetically created their own like 15 million images. So basically in this 15 million images are a lot of variations that was added. The variations were with like different font styles, different font sizes, different font colors, varying backgrounds, and all those things. So let’s say you had to do number three recognition, right? So it’s a totally different scenario. Now, I mean, the kind of vocabulary that goes into a kind of words kind of works on galaxy that goes into those domains are totally different things. Just like certain characters are presented certain followed by certain numbers. So you don’t see a meaningful border with it. So depending on the use case, you have to choose the vocabulary and synthetically create the dataset. So for the product scenario had like 92 characters in the vocabulary, so basically includes capital letters, small letters, numbers, and especially symbols, because product can have like expiring date, the pricing information can be presented in the product, so all those (mumbles) that’s right. And the vocabulary looks like 92 characters. On the right, you can see those synthetically generated images using the Synth Text Library. You can see a lot of variations in that gender data itself.
Now we will see how the pipeline looks like for the Text Recognition.
So given the Output from the Text Detection, we passed the crop those part and send it to the CNN models so that we get the Image Features out of it. Then these Image Features are as an Input to the LSTM Model.
Now whatever Output we have been from the LSTM Model it’s back to the CTC Decoder followed by the final Output that we get from the Text Recognition Model. Now briefly we’ll understand what are the Receptive Fields and then I’ll talk about how these Receptive Fields concepts are connected to the Text Recognition Model.
Typically what happen (mumbles) conventional networks is like a bunch of filters, which are applied on there (mumbles). Let’s say so in this example, let’s say I have like five crossfire image, which is optical view and let’s say, I have like three cross reflector, right? Here if I applied this three cross three filter on this part of the image, I’ll be getting a single value, right? So this single value has visibility on this three cross three patch. So this is nothing but a Receptive Field. So it’s finally behind us reasoning the Input image space that a particular scene in features is looking at, right? So this is a CNN Feature or aFeature Map is looking at three cross replies. That’s something that is (mumbles) Now, once we apply this filter across this image, the Output would be a three plus three, right? So this is the size of the Feature Map now. Let’s say we’ll apply one more filter on top of this three cross three Feature Map, right? Now once we do that we get a single value finally that is the Feature Map, final Feature Map (mumbles).
Now this single value has visibility on the (mumbles).
So now the Receptive Field for this is final crossfire. So basically if you guys are aware of a single-shot Detectors and object detection task, right? So they did the features from not only from the Output final numbers, but also from the intermediate layers. So the intuition is that intermediate layers how a low Receptive Fields and the layers which are close to the task, what we have is let’s say classification tasks or object detection tasks, right? So those have a higher Receptive Field. So in the object detection scenario, we want to detect the object irrespective of its size. So if the object is very small then the Intermediate Features will be handling those, I mean, intermediate layers will be Receptive Field is small and they will be able to handle and detect the small objects. And the features close to the final layer so those are like higher Receptive Fields and they’ll be able to detect the larger object. So that’s why we take like intermediate layer features and (mumbles) features in the object detection task. Now, we’ll see how that’s related to the Text Recognition Tasks. Now I have an image of the size of 1.8 is the weight and 32 is the height and it’s a (mumbless) image. So none is nothing but the bad size. So you can ignore the (mumbles), okay? So given the Input image, so in the model, in the CNN model, I have like a couple of (mumbles) layers and max putting layers so that I get the Feature Map of the shape, one cross 31 into final (mumbles).
So something like this, right? Like I can see like one room, we have 31 columns and find it like, this is the shape of the Feature Map now. So now we need the Receptive Field concept now, okay? So you can see particular value in the Feature Map has visibility on this, I think part of the Input (mumbles). So if you’re taking 431 timesteps, the final value will be, but I think to the final part of the image and the initial values will be, I think that the first part of the image, right? So it’s a sequential in nature now. So to give you an intuition with the an (mumbles) context, let’s say 31 here is nothing but the number of timesteps. So that’s kind of like max sequence length we define when training a typical LSTM modeling. So that is nothing but a number of timesteps and find it well could be thought of make the word embedding damage. Let’s say we are feeding (mumbles) I mean, we are feeding word for particular time step, right? So we are feeding 31 (mumbles) and let’s say the word embedding size for this each word is like finding (mumbles). So that is the intuition, if you want to relate this to the an empty context now. So this Input this Feature Map does an Input to the LSTM model now. And finally, since we have like 31 timesteps, we will be getting the Softmax properties order (mumbles). So what can we do sizes 92 (mumbles). Now here comes the interesting part now, so given an Input image, Hello, so we have the Ground Root has (mumbles), Hello.
It’s like five characters now but my Output from the listing model is four 31 timesteps.
So length of the Ground Truth is not matching to the prediction. They will end up the ground despite, and the prediction is like 39 (mumbles).
How do we calculate the loss? Since the, I mean, the lens are not matching, how do we calculate the loss date? If it was named entity recognition kind of a task, then for every (mumbles) in the Input would be having the tax thing that is it organization, is it relating to person or is it predicted to location or is it something else or others. For every time step, we will be having them Ground Truth for these kinds of any kind of a task. In this scenario, then we can use category build cross-entropy, since the Input lens, sorry, the Ground Truth and the prediction (mumbles) is matching. We can use the categorical cross-entropy as a loss function but for the Text Recognition Task, I think the CRN model, right? So (mumbles) because the Input sorry, the prediction and the Ground Truth lens are not matching data.
Same thing holds good for the speech to text kind of a scenario where we have a speed signal that’s not, I mean, we don’t know, we just have the corresponding text to the speech. We don’t have the information like the H, the letter H was spoken for like two seconds or three seconds, those information is not presently. So can we manually aligned each character to its location in the audio or maybe in the image. Can we do that? So yes, we can do that about that would be a lot of money (mumbles) so, I mean, we can just forget about the training deep learning model because we need a lot of data and we’ll spend most of the time in preparing the data access (mumbles).
So we have CTC loss to the rescue, CTC operation to the rescue. So CTC is nothing but connection is temporal classification. So just a mapping from image to text and not worrying about the alignment of each character to the location of the input image. So we can calculate the loss and we can train the network. So we’ll see what are the Decode operations now, right?
CTC has basically two component, one is CTC Decoder phase and CTC loss, these two are totally different.
As you saw previously the prediction is like, we have like predictions for 31 timesteps and the Ground Truth is like only five highlighters depending on what image is that. So somewhere we have to most of the Output that is coming from the model. So one may we which I can think of is like, we can reduce the reputations. Let’s say I reduced the reputations see we had like three inches present, I just like brought down to single H. (mumbles) about doing the single E and so on. But here, the problem is we need double L, right? So if we just merge the Output that is coming from the LSTM Model, then we’ll lose it and lower it. Whenever there are characters repeated, then we’ll lose a particular character, right? Like I listed, then we need two L but if merge this will get us single L. So what we do is we introduce a special character called as Blank Character. So this character should be part of the vocabulary, there should be a character totally out of vocabulary. And then this needs to be added to build vocabulary. So this is typically called as Blank Character and we will make sure that whenever there is a changing character, our model predicts Blank Character. So instead of just merging the repeats now, so we have the Blank Characters, but stuff is still most of the repeats. Now, after merging the repeats, you can see since Blank Character was introduced, we have we have two separate answers. So once we most of the repeats next step is to remove the blanks. Finally we got (mumbles) which is a very, I mean, same as (mumbles), then we can say that our model is performing good. Now next, we’ll see this was a CTC Decoding step.
Now we’ll see how we can calculate the CTC loss.
Given the lets say, I mean, we’ll not take a very large image because it will be very difficult to show the slides how do we calculate the loss? So let’s take the image as AB that is a Ground Truth, image we have, and the content is AB and let’s say the vocabulary is just A and B. And since we are using CTC as a loss function, we need to introduce this Blank Character. So my vocabulary sizes are three now and let’s say there is only like three times steps. So in the actual use case I had like 31 times steps, to keep it simple, let’s say we just have three times steps. Now we’ll see how we can calculate loss.
So given the Ground Truth AB, I need to find all possible candidates. So on which if I apply a CTC Decode operation, that is like most repeat and drop Blank Characters. I get the prediction, which is same as the (mumbles).
So to give you an example, this is the down (mumbles) AB, now let’s see if my modern inspecting ABP, if I apply the CTC Decode operation on top of this, that is like most repeats, then this B will be two B moving most to a single B. I’m drawing a Blank Character, but there is no black character that is painted, right?
So that is an example of how we apply CTC Decoder function to the predicted Output, right? So basically to calculate the loss given the Ground Truth, we are generating all these candidates. For generating on this candidates, dynamic programming paradigm is used to gender targets candidates subjected to the conditions. The condition is like we should generate only those candidates if we apply CTC Decoder on top of it, we should be getting the (mumbles) this is a condition.
Now let’s see, these are the Softmax properties at t1, t2, t3. The (mumbles) is over the vocabulary. So to keep it simple, we kept the vocabulary as AB and Blank Characters introduced because we are using CTC, that’s right. (mumbles) Now, what is the score for this, what is the probability of observing this AB (mumbles)?
So how do we do that? So what we have to do is we take the Softmax properties 48 at time step one is pointed four times to do the Softmax next property, (mumbles).
Appointed into 0.7 (mumbles) like we have been and explained it.
Finally we get a score for the ABC (mumbles).
So can we just multiply this Softmax properties, t1, t2, t3? Yes, we can, because these are all additional conflict is because we are using the (mumbles).
Now, we have to finally calculate the probability of getting the Ground Truth AB. So this was just one scenario that a lot of the scenarios it’s like AB (mumbles).
It’s like if the model is predicting blank AB is also then it’s fine because I can just apply the CTC Decoder. And that comes out to the AB only and don’t put this also AB then are more or less (mumbles), so it’s an (mumbles). If the model is freaking ABB or AB or Blank AB, AB Blank.
So how do we calculate the probability of getting done? Now we just add (mumbles) basically property principles (mumbles).
So after adding this properties, having ended up getting (mumbles) AB. So then finally the losses minus (mumbles) property of getting (mumbles).
So the property values can’t be better than one or less than zero, so the max value is one. So let’s say if my model is doing really well, then that means the property of getting the (mumbles) will be very high, that is equal to one and note one is zero and minus zero plus zero it doesn’t matter. So losses is zero, okay? Now let’s say my model is not doing better, to start the training process, to model will be initially learning and model is not doing. So in that case, the property of getting the ground to AB would be zero. So in the next two slides are the example for that I’m just attending what could be the max values of the last function. Now property, if the property of getting zero, not zero is minus, like it’s ending towards minus of minus infinity is infinity, like the loss is tending to infinity. So whenever there is a perfect mismatch and the model is not doing well.
And whenever we have a mismatch, we can see the property of getting the contract to zero and this tends to infinity that you can take it offline and do a deep dive on this.
Now, this was like how we mathematically calculate the lost, but intensive that we have in build function, so we need not worry about calculating the last mile a week and just use the CTC batch cost function and we can get the last CTC loss. And this was how the model architecture looks like, so as I said earlier, like we have like a couple of unleashing layers, all the blood max pulling this, and then this part doesn’t Input to the Alice team.
So as I said earlier, I had like 50 million images, I find load all these images in the memory, so if I know it will come up to like 690GB. So we were using a generator functionality caters, which will load only the particular batch in memory and with the help of workers max_queue_size and multi-processing we can speed up this training process. And training was done using P100 GPU to us for a single epoch and a prediction time was like one second for batch of 2,048 images.
So that was the training side of Text Recognition with CRN and CTC model, there are several other approaches one can try this, so it’s like a one contained and attention model and could be a decoder framework. And basically they can get rid of CPC and have a constant to be lost over here and I’ve seen many people using the spatial transformation networks just before plugging into the Extra Condition Model. So basically distorted images rectified and we get a property mates using special television networks and that is where does an Input to them for the modern, right? It’s like a scene in front of by the attention mechanism model CNN followed by CRNN CTCmodel. So I have the references, maybe you can look into it offline.
So this is what I had, so I know I have covered a lot of content. So the aim was to provide an overview of different techniques that one could apply for Text Extraction Tasks and on the content of this session is presented this (mumbles), you can refer the URL, just scan this QR code.
So if you have any questions, I can take it now.
Walmart Global Tech India
Rajesh Shreedhar Bhat is working as a Senior Data Scientist at Walmart Labs, Bangalore. He completed his Bachelor’s degree from PESIT, Bangalore, and currently pursuing Masters from Arizona State University in CS with ML specialization.
He has a couple of research publications in the field of NLP and vision, which are published at top tier conferences such as ECML-PKDD, CoNLL, ASONAM, etc.. He is a Kaggle Expert(World Rank 966/122431) with 3 silver and 2 bronze medals and has been a regular speaker at Kaggle days meetups.
Apart from this, Rajesh is a mentor for Udacity Deep learning & Data Scientist Nanodegree for the past 2 and half years and has conducted ML & DL workshops in GE Healthcare, IIIT Kancheepuram, and many other places.