Classifying Text in Money Transfers: A Use Case of Apache Spark in Production for Banking

Download Slides

At BBVA (second biggest bank in Spain), every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine has been developed in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5M customers daily. In our proposed presentation, we plan to describe the process undergone by the Data Science team. This includes the problem (classify 700K daily transfers by its text), the data science challenges, the algorithmic and engineering solution, and the achievements and learnings. To make the talk practical and focused, we will walk through our best performing pipeline, which uses (i) word2vec embeddings to represent words, (ii) a pooling algorithm to aggregate them into sentence embeddings, (iii) a supervised classifier. We believe this is relevant because it mixes (i) off-the-shelf MLLib funcions, such as the word2vec embeddings, (ii) components that build on Mllib and were adapted (e.g. a calibrated multic-class logistic regression which outputs probabilities) and (iii) an own implementations of the pooling algorithm, which is known as the Vector of Locally Aggregated Descriptors (VLAD). We highlight that we are not aware of the previous application of word2vec with VLAD in NLP, so this would introduce a novelty. The relevant audience are data scientists, engineers, team leaders and executives interested in Spark and seeking examples of machine learning deployments in a real-world productive environment. The main takeaways will be: (i) it is relatively simple to build a text classification system in Spark, (ii) with extra effort one it is feasible to build state-of-the-art, productive solutions. We will mention the problems we found in practice, such as how to design a training corpus to maximize precision, not recall, or how we designed the system against “catastrophic” classification mistakes.

Session hashtag: #EUds7

About Jose A. Rodriguez-Serrano

Jose is Lead Data Scientist for BBVA Data & Analytics since 2015, area of Advisory and Predictive models. Their role is to put into production and machine learning models to improve experience of banking app users through big data infrastructure. Formerly, Area Manager for Machine Learning at Xerox Research. Started as AI researcher (PhD in computer vision) with 9 years of experience in industrial innovation with machine learning solutions, including banking, document workflow automation, or traffic sensing. Several top-tier publications and over 20 patents. Fascinated by solving real-world problems with state-of-the-art AI and making machine learning easy to use by non-experts.