Alexander Thomas

Principal Data Scientist, Wisecube AI

Alex Thomas is a principal data scientist at Wisecube. He’s used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical data. Alex is also the author of Natural Language Processing with Spark NLP.

Past sessions

Summit 2021 Drug Repurposing using Deep Learning on Knowledge Graphs

May 26, 2021 04:25 PM PT

Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.

How do we find these undiscovered uses for existing drugs?

We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.

In this talk we will cover:

  • Building the knowledge graph
  • Predicting latent relationships
  • Using the latent relationships to repurpose existing drugs
In this session watch:
Alexander Thomas, Principal Data Scientist, Wisecube AI
Vishnu Vettrivel, Founder and CEO, Wisecube AI

[daisna21-sessions-od]

Summit Europe 2020 Using NLP to Explore Entity Relationships in COVID-19 Literature

November 17, 2020 04:00 PM PT

In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.

What you will learn
- How to extract named entities without a model
- How to bootstrap an NLP model from rule-based techniques
- How to identify relationships between entities in text.

Speakers: Alexander Thomas and Vishnu Vettrivel

Summit 2020 Advanced Natural Language Processing with Apache Spark NLP

June 25, 2020 05:00 PM PT

NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. It's the most widely used NLP library in the enterprise today. You'll edit and extend a set of executable Python notebooks by implementing these common NLP tasks: named entity recognition, sentiment analysis, spell checking and correction, document classification, and multilingual and multi domain support. The discussion of each NLP task includes the latest advances in deep learning used to tackle it, including the prebuilt use of BERT embeddings within Spark NLP, using tuned embeddings, and 'post-BERT' research results like XLNet, ALBERT, and roBERTa. Spark NLP builds on the Apache Spark and TensorFlow ecosystems, and as such it's the only open-source NLP library that can natively scale to use any Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. You'll run the notebooks locally on your laptop, but we'll explain and show a complete case study and benchmarks on how to scale an NLP pipeline for both training and inference.

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. This talk introduces the NLP library for Apache Spark.

It natively extends the Spark ML pipeline API's which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark's built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection.

The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.

This is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API's which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark's built-in optimizations. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection.

With the dual goal of delivering state of the art performance as well as accuracy, primary design challenges that we'll cover are:

  1. Using efficient caching, serialization & key-value stores to load large models (in particular very large neural networks) across many executors
  2. Ensuring fast execution on both single machine and cluster environments (with benchmarks)
  3. Providing simple, serializable, reproducible, optimized & unified NLP + ML + DL pipelines, since NLP pipelines are almost always part of a bigger machine learning or information retrieval workflow
  4. Simple extensibility API's for deep learning training pipelines, required for most real-world NLP problems require domain-specific models.

This talk will be of practical use to people using the Spark NLP library to build production-grade apps, as well as to anyone extending Spark ML and looking to make the most of it.

Session hashtag: #DD4SAIS

Summit 2018 State of the Art Natural Language Processing at Scale

June 4, 2018 05:00 PM PT

This is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API's which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark's built-in optimizations. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection.

With the dual goal of delivering state of the art performance as well as accuracy, primary design challenges that we'll cover are:

  1. Using efficient caching, serialization & key-value stores to load large models (in particular very large neural networks) across many executors
  2. Ensuring fast execution on both single machine and cluster environments (with benchmarks)
  3. Providing simple, serializable, reproducible, optimized & unified NLP + ML + DL pipelines, since NLP pipelines are almost always part of a bigger machine learning or information retrieval workflow
  4. Simple extensibility API's for deep learning training pipelines, required for most real-world NLP problems require domain-specific models.

This talk will be of practical use to people using the Spark NLP library to build production-grade apps, as well as to anyone extending Spark ML and looking to make the most of it.

Session hashtag: #DD4SAIS

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

This talk introduces the NLP library for Apache Spark. It natively extends the Spark ML pipeline API's which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark's built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entityrecognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.

Session hashtag: #DS1SAIS

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. Ideally, all three of these pieces should be able to be integrated into a single workflow. This makes development, experimentation, and deploying results much easier. Spark's MLlib provides a number of machine learning algorithms, and now there are also projects making deep learning achievable in MLlib pipelines. All we need is the NLP annotation frameworks. SparkNLP adds NLP annotations into the MLlib ecosystem. This talk will introduce SparkNLP: how to use it, its current functionality, and where it is going in the future.
Session hashtag: #EUdd4