David Talby

Chief Technology Officer, Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and running web-scale data science and business platforms and teams – in startups, for Microsoft’s Bing Shopping in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

UPCOMING SESSIONS

PAST SESSIONS

Advanced Natural Language Processing with Apache Spark NLPSummit 2020

NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. It's the most widely used NLP library in the enterprise today. You'll edit and extend a set of executable Python notebooks by implementing these common NLP tasks: named entity recognition, sentiment analysis, spell checking and correction, document classification, and multilingual and multi domain support. The discussion of each NLP task includes the latest advances in deep learning used to tackle it, including the prebuilt use of BERT embeddings within Spark NLP, using tuned embeddings, and 'post-BERT' research results like XLNet, ALBERT, and roBERTa. Spark NLP builds on the Apache Spark and TensorFlow ecosystems, and as such it's the only open-source NLP library that can natively scale to use any Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. You'll run the notebooks locally on your laptop, but we'll explain and show a complete case study and benchmarks on how to scale an NLP pipeline for both training and inference.

Automated and Explainable Deep Learning for Clinical Language Understanding at RocheSummit 2020

Unstructured free-text medical notes are the only source for many critical facts in healthcare. As a result, accurate natural language processing is a critical component of many healthcare AI applications like clinical decision support, clinical pathway recommendation, cohort selection, patient risk or abnormality detection. Recent advances in deep learning for NLP have enabled a new level of accuracy and scalability for clinical language understanding making a broad set of applications possible for the first time.

The first part of this talk will cover the deep learning techniques, explain-ability features, and NLP pipeline architecture that has been applied. We'll provide a short overview of the key underlying technologies: Spark NLP for Healthcare, BERT embeddings, and healthcare-specific embeddings. Then, we'll describe how these were applied to tackle the challenges of a healthcare setting: understanding clinical terminology, extracting specialty-specific facts of interest, and using transfer learning to minimize the required amount of task-specific annotation. The use of MLflow and its integration with Spark NLP to track experiments and reproduce results will also be covered.

The second part of the talk will cover automated deep learning: the system's ability to train, tune and measure models once clinical annotators add or correct labeled data. We will cover the annotation process and guidelines; why automation was required to handle the variety in clinical language across providers, document types, and geographies; and how this works in practice. Providing explainable results - including highlighting evidence in the text for extracted semantic facts - is another critical business requirement that we'll show how we've addressed. This talk is intended for data scientists, software engineers, architects and leaders who must design real-world clinical AI applications and are interested in lessons learned applying the latest advances in NLP and deep learning in this space.

Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable, and Unified Natural Language ProcessingSummit 2019

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. This talk introduces the NLP library for Apache Spark.

It natively extends the Spark ML pipeline API's which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark's built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection.

The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.

State of the Art Natural Language Processing at Scale – continuesSummit 2018

This is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API's which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark's built-in optimizations. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. With the dual goal of delivering state of the art performance as well as accuracy, primary design challenges that we'll cover are:

  1. Using efficient caching, serialization & key-value stores to load large models (in particular very large neural networks) across many executors
  2. Ensuring fast execution on both single machine and cluster environments (with benchmarks)
  3. Providing simple, serializable, reproducible, optimized & unified NLP + ML + DL pipelines, since NLP pipelines are almost always part of a bigger machine learning or information retrieval workflow
  4. Simple extensibility API's for deep learning training pipelines, required for most real-world NLP problems require domain-specific models.
This talk will be of practical use to people using the Spark NLP library to build production-grade apps, as well as to anyone extending Spark ML and looking to make the most of it. Session hashtag: #DD4SAIS

State of the Art Natural Language Processing at ScaleSummit 2018

This is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API's which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark's built-in optimizations. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. With the dual goal of delivering state of the art performance as well as accuracy, primary design challenges that we'll cover are:

  1. Using efficient caching, serialization & key-value stores to load large models (in particular very large neural networks) across many executors
  2. Ensuring fast execution on both single machine and cluster environments (with benchmarks)
  3. Providing simple, serializable, reproducible, optimized & unified NLP + ML + DL pipelines, since NLP pipelines are almost always part of a bigger machine learning or information retrieval workflow
  4. Simple extensibility API's for deep learning training pipelines, required for most real-world NLP problems require domain-specific models.
This talk will be of practical use to people using the Spark NLP library to build production-grade apps, as well as to anyone extending Spark ML and looking to make the most of it. Session hashtag: #DD4SAIS

Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Natural Language ProcessingSummit 2018

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. This talk introduces the NLP library for Apache Spark. It natively extends the Spark ML pipeline API's which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark's built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entityrecognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk. Session hashtag: #DS1SAIS