Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels then modifying and retraining the model. When testing data is present at the PetaByte scale the ability to seamlessly access all the images that have been assigned specific labels poses a technical challenge by itself.In this talk we present a solution that automates the process of running the model on the testing data and populating an index of the labels, so they become searchable. In this implementation images and labels are stored in HBase, the model is encapsulated in a PySpark program, while the images are indexed with Solr and can be accessed from a Hue dashboard.
Marton Balassi is a Solution Architect at Cloudera. He focuses on Big Data application development, especially in the streaming and data science space. He is a PMC member at Apache Flink. Marton is a regular contributor to open source and has been a speaker of a number of Big Data related conferences and meetups, including Hadoop Summit and Apache Big Data recently.