Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platform, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Offensive Text Classification, Product Matching, Compliance use cases, etc. Extracting text is achieved in 2 stages. Text detection: The detector detects the character locations in an image and then combines all characters close to each other to form a word based on an affinity score which is also predicted by the network. Since the model is at a character level, it can detect in any orientation. Post this, the text is then sent through the Recognizer module. Text Recognition: Detected text regions are sent to the CRNN-CTC network to obtain the final text. CNN’s are incorporated to obtain image features that are then passed to the LSTM network as shown in the below figure. Connectionist Temporal Classification(CTC) decoder operation is then applied to the LSTM outputs for all the time steps to finally obtain the raw text from the image.
Rajesh Shreedhar Bhat is working as a Data Scientist at Walmart, Bangalore. His work is primarily focused on building reusable machine/deep learning solutions that can be used across various business domains at Walmart. He completed his Bachelor's from PESIT and currently pursuing his MS in CS from ASU. He has research publications in the field of NLP and Vision, which are published at top tier conferences such as CoNLL, ASONAM, etc.. and he has filed 6 US patents in Retail space leveraging AI & ML. Also, he is Kaggle Expert with 3 silver and 2 bronze medals.