This is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API’s which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark’s built-in optimizations. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection.
With the dual goal of delivering state of the art performance as well as accuracy, primary design challenges that we’ll cover are:
This talk will be of practical use to people using the Spark NLP library to build production-grade apps, as well as to anyone extending Spark ML and looking to make the most of it.
Session hashtag: #DD4SAIS
Alex Thomas is a data scientist at Indeed. Over his career, Alex has used natural language processing (NLP) and machine learning with clinical data, identity data, and (now) employer and jobseeker data. He has worked with Apache Spark since version 0.9 as well as NLP libraries and frameworks including UIMA and OpenNLP.
David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. Previously, he was with Microsoft's Bing Group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon's financial systems. David holds a PhD in computer science and master's degrees in both computer science and business administration.