State of the Art Natural Language Processing at Scale

Download Slides

This is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API’s which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark’s built-in optimizations. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection.

With the dual goal of delivering state of the art performance as well as accuracy, primary design challenges that we’ll cover are:

  1. Using efficient caching, serialization & key-value stores to load large models (in particular very large neural networks) across many executors
  2. Ensuring fast execution on both single machine and cluster environments (with benchmarks)
  3. Providing simple, serializable, reproducible, optimized & unified NLP + ML + DL pipelines, since NLP pipelines are almost always part of a bigger machine learning or information retrieval workflow
  4. Simple extensibility API’s for deep learning training pipelines, required for most real-world NLP problems require domain-specific models.

This talk will be of practical use to people using the Spark NLP library to build production-grade apps, as well as to anyone extending Spark ML and looking to make the most of it.

Session hashtag: #DD4SAIS



« back
About Alexander Thomas

Alex Thomas is a principal data scientist at Wisecube. He's used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical data. Alex is also the author of Natural Language Processing with Spark NLP.

About David Talby

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP - the world's most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.