Alexander Thomas

Data Scientist, Indeed

Alex Thomas is a data scientist at Indeed. Over his career, Alex has used natural language processing (NLP) and machine learning with clinical data, identity data, and (now) employer and jobseeker data. He has worked with Apache Spark since version 0.9 as well as NLP libraries and frameworks including UIMA and OpenNLP.

SESSIONS

State of the Art Natural Language Processing at Scale

This is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API's which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark's built-in optimizations. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. With the dual goal of delivering state of the art performance as well as accuracy, primary design challenges that we'll cover were: (1) Using efficient caching, serialization & key-value stores to load large models (in particular very large neural networks) across many executors (2) Ensuring fast execution on both single machine and cluster environments (with benchmarks) (3) Providing simple, serializable, reproducible, optimized & unified NLP + ML + DL pipelines, since NLP pipelines are almost always part of a bigger machine learning or information retrieval workflow (4) Simple extensibility API's for deep learning training pipelines, required for most real-world NLP problems require domain-specific models. This talk will be of practical use to people using the Spark NLP library to build production-grade apps, as well as to anyone extending Spark ML and looking to make the most of it.

Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Natural Language Processing

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. This talk introduces the NLP library for Apache Spark. It natively extends the Spark ML pipeline API's which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark's built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entityrecognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.

Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and TensorFlow

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. Ideally, all three of these pieces should be able to be integrated into a single workflow. This makes development, experimentation, and deploying results much easier. Spark's MLlib provides a number of machine learning algorithms, and now there are also projects making deep learning achievable in MLlib pipelines. All we need is the NLP annotation frameworks. SparkNLP adds NLP annotations into the MLlib ecosystem. This talk will introduce SparkNLP: how to use it, its current functionality, and where it is going in the future. Session hashtag: #EUdd4