Skip to main content
Company Blog

Databricks' commitment to education is at the center of the work we do. Through Instructor-Led Training, Certification, and Self-Paced Training, Databricks Academy provides strong pathways for users to learn Apache Spark™ and Databricks to push their knowledge to the next level.

Our latest offering is a series of short videos introducing the Natural Language Processing technique, Latent Semantic Analysis (LSA). This series explains the conceptual framework of the technique and how the Databricks Runtime for Machine Learning can be used to apply the technique to a body of text documents using Scikit-Learn and Apache Spark.

If you’d like to follow along with the videos on your own computer, simply download the Databricks notebook. If you don’t have a Databricks account yet, get started for free on Databricks Community Edition.

If you’d like to dive deeper into Machine Learning using Databricks, check out our self-paced course Introduction to Data Science and Machine Learning / AWS (also available on Azure) at Databricks Academy.

Introduction to Latent Semantic Analysis

This video introduces the core concepts in Natural Language Processing and the Unsupervised Learning technique, Latent Semantic Analysis (LSA). The purposes and benefits of the technique are discussed. In particular, the video highlights how the technique can aid in gaining an understanding of latent, or hidden, aspects of a body of documents—in addition to reducing the dimensionality of the original dataset.

A Trivial Implementation of LSA using Scikit-Learn

This video introduces the steps in a full LSA Pipeline and shows how they can be implemented in Databricks Runtime for Machine Learning using the open-source libraries Scikit-Learn and Pandas.

These steps are:

This video uses a trivial list of strings as the body of documents so that you can compare your own intuition to the results of the LSA. After completing the process, we examine two byproducts of the LSA—the dictionary and the encoding matrix—in order to gain an understanding of how the documents are encoded in topic space.

A Second LSA

Here we work through the same steps from the previous video in a second full LSA Pipeline, once more in Databricks Runtime for Machine Learning using the open-source libraries Scikit-Learn and Pandas.

This video uses a slightly more complicated the body of documents: strings of text from two popular children’s books. After completing the process, we examine two byproducts of the LSA—the dictionary and the encoding matrix—in order to gain an understanding of how the documents are encoded in topic space. Finally, we plot the resulting documents in their topic-space encoding using the open source library Matplotlib.

Improving the LSA with a TFIDF

This video works through a third full LSA Pipeline using Databricks’ Runtime for Machine Learning and the open-source libraries Scikit-Learn and Pandas.

Here we iterate on the previous LSA Pipeline by using an alternate method, Term Frequency-Inverse Document Frequency, to prepare the Document-Term Matrix. After completing the process, the video examines two byproducts of the LSA—the dictionary and the encoding matrix—in order to gain an understanding of how the documents are being encoded in topic space. Finally, the video plots the resulting documents in their topic-space encoding using the open source library Matplotlib and compares the plot to the plot prepared in the previous video.

Latent Semantic Analysis with Apache Spark

In this video, we begin looking at a new, larger dataset: the 20 newsgroups dataset. In order to work with this larger dataset, we move the analysis pipeline to Apache Spark using the Scala programming language. This video introduces a new type of NLP-specific preprocessing: lemmatization. We also discusses key differences between performing NLP in Scikit-Learn and Apache Spark.

We hope that you find these videos informative, as well as entertaining! The full video playlist is here. If you’d like to dive deeper into Machine Learning using Databricks, check out our self-paced course Introduction to Data Science and Machine Learning / AWS (also available on Azure) at Databricks Academy,