Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.
This talk introduces the NLP library for Apache Spark. It natively extends the Spark ML pipeline API’s which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark’s built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entityrecognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.
Session hashtag: #DS1SAIS
Alex Thomas is a data scientist at Indeed. Over his career, Alex has used natural language processing (NLP) and machine learning with clinical data, identity data, and (now) employer and jobseeker data. He has worked with Apache Spark since version 0.9 as well as NLP libraries and frameworks including UIMA and OpenNLP.
David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and running web-scale data science and business platforms and teams - in startups, for Microsoft's Bing Shopping in the US and Europe, and to scale Amazon's financial systems in Seattle and the UK. David holds a PhD in computer science and master's degrees in both computer science and business administration.