Advanced Natural Language Processing with Apache Spark NLP

May 26, 2021 03:15 PM (PT)

Download Slides

This hands-on deep-dive session uses the open-source Apache Spark NLP library to explore advanced NLP in Python. Apache Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. Apache Spark NLP is the only open-source NLP library that can natively scale to use any Apache Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. It’s the most widely used NLP library in the enterprise today.

You’ll edit and run executable Python notebooks as we walk through these common NLP tasks: document classification, named entity recognition, sentiment analysis, spell checking and correction, grammar understanding, question answering, and translation. The discussion of each NLP task includes the latest advances in deep learning and transfer learning used to tackle it – from the hundreds of BERT-based embeddings to models based on the T5 transformer, MarianNMT, multilingual and domain-specific models.

In this session watch:
David Talby, Chief Technology Officer, John Snow Labs

 

Transcript

Speaker 1: Hello everyone, and welcome. My goal with this session is to help you with your text mining and natural language processing projects, using the free open source Spark and APY library. So let’s get started for those of you who don’t know it, introduced the library and also give some updates about what changed in the last year. Then we’ll talk a bit about new improvements in features in accuracy, in speed and scalability. Then show you some code so you can see just how easy it is to either use existing staff or train your own models. And we’ll walk through some examples.
So Spark NLP, for those of you that don’t know it, is the most widely used natural language processing library in the enterprise for three years in a row, based on several industry surveys, the largest one had been led every year by O’Reilly Media. And basically what we see is that the great libraries that are used a lot in research and academia, but when people move to commercialize and productize fix mining efforts, they often move to Spark LP or scale with Spark NLP at that point. As a company Johnson Labs is focused on the healthcare and life science space. And specifically within healthcare, based on the latest and will be industry survey, Spark NLP to be used by 54% of teams. Also with the same survey as Spark NLP is also used by 36% of all teams that use any NLP library.
And the last thing to note, and really this is a thank you to you, to the community too overall. We’ve seen a 16X growth in the number of downloads of Spark NLP will be just from January, 2020 to April of 2021 is what’s kind of 16X in exactly 16 months and just announced closely to 5 million downloads in milestone last week.
So Spark NLP is an open source software library or state of art natural language processing, with the goal of taking the latest improvements in research around mostly on deep learning, transfer learning nowadays, and providing you with production grade scalable and printable versions of those new innovations. And the library has a full API for Python, which is the most widely used as well as Scala and Java and which are heavily used as well. It also comes with an ecosystem of pre-trained models and pipelines. And we show how we work with these online. Right now there are over 1400 of them, with more be added every week.
The library is well-known for being very actively developed. So we’ve been releasing software every two weeks since October, 2017 until today. And intend to continue doing it going forward. And once again, I would like to thank the community, the industry for helping out, providing feedback, code documentation, back fixes examples to help it. This has enabled us to not just keep releasing, really keep going functionality in a very significant way and keeping up with a very fast moving field in terms of actually giving you state of now of accuracy, speed, scalability, which kept improving every year. So one thing that happens if you claim that you’re doing state of the art it means that you are almost always lean implementing things you have done before, because by the time you implement something a new paper comes out, there’s a better way to do something, which then we don’t support.
As I mentioned, basically every, every industry bought. If you look at what has Spark NLP as the most widely used library, there are definitely other excellent ones, Spacey, there was Hugging face, there was NLPK, Stanford NLP which I’ll use in different contexts. Some of them have been the longest. Some of them are really better suited for research and academic. Spark NLP is really best in place. We need a production grade and NLP pipeline and NLP library. Whether working on your machine or been able to scale, to large scale systems. And as you can see here from the largest companies in the world to companies in retail, in energy, in healthcare in claiming insurance. Been using the library for quite a while now with three and a half years in and been helping build it in helping grow the community around it.
Our focus is usually not on providing new interpretations and new results. And although there are several papers where we’ve written in new academic papers, we’ve actually this set of [inaudible], but usually our focus is on taking new academic results and [inaudible]. Because as I’m sure you know, if you’ve tried it, there’s a very big difference between having an academic paper that says, “Yes, we’ve been able to improve accuracy, we’ll do something new, and something that you can actually use day-to-day and depend upon to do your job and have systems that work, and serve people and do good. And you can see some of the aspects that we focus on, as first of all, trainable and tuneable systems. So all the models and all the capabilities I’m going to show you today with Spark NLP are trainable models.
So if you need your own name entity recognizer, your own document classifier, your own emotion detector, your own spell checker, your own language detector, and all of those are trainable and tuneable. Another important thing is privacy. So Spark NLP, we do anything as a service. This is designed to run on new infrastructure within new environment. It does not send anything outside. It does not call home, does not share anything with any third party, not us, not anyone else. And it’s heavily used in high compliance industries. So healthcare, finance, life science, insurance, government. Expandability and reproducing till, to keep reproducibility is obviously important.
The results in the accuracy metrics, I’ll show you here are reproducible. In the sense that you can take the notebooks. You can run them yourself and see the same results. And those mean a lot of work around logging, experiment raking, giving you better visibility into what you experiments are doing. We’ve done work on how to optimization, we’ll talk more about that, about making sure that out of the box, you make the most of the hardware you have. Of course, that spontaneity in terms of scalability is based on partly Spark. And until today, it’s really, it’s the only open source library that’s natively scalable to any cluster of CPUs, GPUs. And the last thing is really building a community, which Data Plus summit is a big part of it. And then there’s also the NLP Summit, which we organize as well. Really to help educate, share lessons learnt, case studies and help grow the community.
In terms of what it does, Spark NLP really is an end-to-end library, meaning that you really shouldn’t, and usually you don’t use any other libraries in conjunction. It starts with the best basic and symbol functionality. Kind of the things that NFK and Space [inaudible] do. So you get a tokenization, nematization and stemming, [inaudible] speech taking, dependency parsing, and normalization, data extraction something with analysis. On top of it you get the more sophisticated, deep learning models. So state of the art name entity recognition, information extraction, document classification, emotion analysis and we’ll see some examples of that down the road. The library comes with its own transformers. And there’s a fairly large ecosystem of them. And what it means, it means that we take care of, we’re optimizing how we load transformers, how we cache them in memory. If we need to, how we distribute them across the cluster. If we actually have some customer implementations to profile and optimize speed and memory use. And make sure that you can share easily share transformers across different parts of the pipeline.
So if you have, for example, multiple entities, you need to extract and then multiple document classifiers, you can cache and share the same embeddings, both for training and for inference. Another advantage of just having a single pipeline, does everything for you, basic sentence segmentation to question answering, translation, summarization. Is that we can then natively scale the entire pipeline. So you’ll move to the point where you want to scale to a cluster, but because you are mixing in, you’re doing some things with Pasis, some with Sanford, some with Hugging Face, really not everything benefits from scaling.
We need to reload things or move things between memory spaces multiple times. So here both in terms of speed and also just as importantly, in terms of just elegance of the code, so just UIUX code. It’s more elegant. It’s easier to understand, easier to debug, and really what we are seeing is that really outside kind of homework examples. When you get to real systems, very often the feedback that we get from people is that Spark NLP takes some time to pull out the API. But then at the end, what people see is really the resulting system is much simpler than what they thought they’d have or what they had before.
A month ago we’ve introduced a new major version, which is Spark NLP 3. Spark NLP 3 first of all, runs on Spark 3, on Spark 3.0 and 3.1. We also maintain support for Spark 2.3 and 2.4, because there was a lot of enterprise use in production of the software. So we are committed to that and committed to make sure that whenever people use Spark NLP in production it’s going to run and one will. Spark NLP 3, if you look at the compute platform we support. So now we support really all database versions, 6X, 7X and 8X. Whether with CPU or GPU we’ve tested and optimized on all of them.
If you’re working on a single machine, and this is very common, sometimes we get the question of whether Spark NLP only applies if you’re working on the cluster. But probably most people just work on the local machines. So we have built for Linux, for Mac and for Windows, to work you can work either just on local machine or you can work within Docker with or without Kubernetes. We have support for the latest AWS ML versions. We’ve tested on AWS, on Azure and GCP, both the 2.7 and 3X versions in cloud or on your own [inaudible].
And so as you can imagine, we spend a lot of the last quarter testing these on different environment and making sure, not only that it works but also that it’s highly optimized. And what you can see on the right are recent performance benchmarks. These are benchmarks we actually did on Databricks on Databricks 7.2. On AWS with 10 different machines, just comparing a Spark NLP 2.7 Spark NLP 2.3. And as you can see, a calculating Bert is between 6.5 to 7.9 times as faster. So very, very kinetic and moves most whenever you’re using [inaudible]. And for the most common use case of name entity of mission, we see a 3X speed up on GPUs but really just between the two version of Spark NLP. So it just goes to show the significance, amount of [inaudible] filing, optimization, and also will work with [inaudible] and NVIDIA that we’ve done in the last few months.
The next thing we’re going to talk about is accuracy and what the state of the alt accuracy within our context. But before that, I’d like to move and show you some examples of the library in action. To reach this website basically, all you need to do is Google Spark NLP. And the first site that would come up is this page, which is nlp.johnsnowlabs.com. Which is the homepage for the open source library. What you can do here is of course you can get started and get download instructions for Python, for Java, for Databricks for Scala, for AWS, whatever environment it is. You can look at some demos or you can of course go to the code itself. You can look into contribute. Interesting terms that you have here, there’s the entire documentation, the entire reference documentation for the library, there’s a learning hub, which has a whole bunch of videos and articles about the library. There is the models hub, and as we mentioned, there’s a big collection of models. As of today, 1,444 models exactly that are here and all searchable by task, by human language and by Spark NLP edition. And the live demos.
And [inaudible] are interesting is because they’re not just demos, you also have the code examples to go with them. And as you can see for each demo here, you have the live demo. And so, for example, if you go to identifying fake news, you have a stream litter. We chose, okay, it’s known [inaudible] KGB spy, maybe, maybe not. We have the actual text in, the model ran on it and in this case, the model predict that this is a fake news item with the confidence of 100%, right?
If we look at the different examples. So Rubio an apps key from me [inaudible] you have the text here, and in this case, the model predicts that this is real, that his is not fake. Also with the confidence of 100%. So this is just one example of document classification. Document classification is the text example. We will take a piece of text. I think this is this text and need to classify it, right? In this case, we only classify whether it’s real or fake. And you do the same thing, for example with [inaudible] detection. What’s important to know is that for each of those [inaudible] demos, you also can click an opening column here on the left. And this will give you the actual code behind these numbers. Okay?
So what you have here is over 100 example, 100 column notebooks. Yeah. So you can click open [inaudible] and you can just run it yourself. If you want to change your text, run it on your own text, just open in column this is now your own private environment. You can change your parameters, change the model, change the demo text, do whatever it is that you need to do to see if this works for your own case. And then you can see the code. So one thing, in terms of installation, installation is the one line [inaudible].
So you just do [inaudible] .com/column.sh. And run it as [inaudible]. And does whatever it needs to do to install Spark NLP in that cloud environment. Then you have a set of imports and then another one liner to start Spark NLP session. And this is important because, one useful things to know about Spark NLP is that, and you’ll see it in the code here, is that you really don’t need to know this KPI, or you’re not forced to use it. If you’d just like to do something local on your machine and you’re certainly thinking, “Can I just use [inaudible] on a local machine and I don’t need all that, really you do not have to. If you want to, Spark NLP [inaudible] the Spark session. I think you can configure it. You configure multiple nodes, large cluster, Databricks whatever you need to take advantage here. In this case, what you can see here. You can see some of the examples.
KGB spy is in Barack Obama. And then what we do, we define… The main code defines the NLP pipeline. So Spark and TP is based on Spark NL. So an NLP pipeline is a Spark NL pipeline. So we use the same pipeline [inaudible]/nl. Which makes it super easy to integrate the Spark NLP [inaudible]. So if you need today to a 100 terabyte of text data from SP, you log it into a Spark data framework and if you need to, after NLP, you have some NL tasks that you want to do with Spark NL, not only you can do them, you can do them and it’s going to run, distribute it on the cluster.
So it indicates very nicely the ecosystem, and really the more topics of use case the more you’ll see the benefits in terms of optimization. And what we’re doing here, for classification is very simple. We take the text we created into a document, that’s a Spark NLP document. That’s kind of the first [inaudible]. We calculate sentencing and wordings. So we log in with those sentence [inaudible] and we ask the model to calculate It. And then we actually pull the document classified. In this case, we load the pre-trained model. And here, the model name comes from [inaudible] which in this case could be [inaudible] model [inaudible], then we define the pipeline where we pull the assembler. We calculate in details of sentencing [inaudible].
And that’s the entire pipeline. Once we load the results, you can do two things with an NLP pipeline. You can do fit, you train. So every pipeline is trainable in this case, we do not need to train, everything is pre-trained. And so we just show it as an example. And then the other method other than [inaudible] is transformed, which means please do inference. Hey, once you do inference you see the results, and you can see that Donald Trump is a KGB spies is fake. Barack Obama said former Secretary of State Hillary Clinton use of [inaudible]. That was real and so on. You can see the really here load the classifier and it did. On your own data set. So that’s one example of a document classification.
So we’re here. Look at the different example. And another really common task is entity recognition. So entity recognition is the task of looking at the piece of text and identifying entities. So for example, William Henry Gates III, who was born this date. And he’s an American, he’s [inaudible] and so on. And he want to name a recognizer, in this case, he tries specifically to identify things [inaudible] organization, date and a few other things. So you want to find that William Henry Gates III, all four tokens together are one entity and the entity is the person, right? October, 28th 1955 is a date. And so on for the other entities. Here again, you have the examples, but just as you can click Live Demo, you can also click Colab Notebook and go to the notebook. As you can see this notebook is [inaudible] you have the one line installer, after the installer, you going to imports. After the imports you do a [inaudible] start. You have the example text here. You define the pipeline. So these pipeline creates a document. It tokenizes it’s creating the texts into words, based on the model name.
So if the model name is a name it recognizes a particular name. It loads this model. If the model name is a [inaudible] it leads a different model that is based on those embeddings. And then it loads the [inaudible] model name. And then the converter basically collects all the tokens. You can see the log entities, you can see the call logs for one entity. You define the pipeline. You run it. You do transform. And one other interesting thing to show here, there’s another open source library that we’ve open sourced, which is sparknlp_display which enables you to see all the visualization, you’ve seen the single stops within typica.
So once you load it, once we import the NerVisualizer, its NER visualizer will display and then you can basically look at the results and see them here. Within a notebook, trusted you’ve seen them before and visualize our results. And similarly if you go back, [inaudible], go to demos. Other interesting entity is you can look at things like emotion detection, right? So we can look at this tweet, for example. And this is surprise tweet. Okay, and we can look at another example. This is a sub tweet. Maybe I turned off my internet [inaudible]. This is another surprise tweet, the moment you see your friend in a commercial [inaudible]. And this is a feel good [inaudible].
Okay. And this is also an example, [inaudible] document cluster. So you have those examples as well. The other thing I want to show before we go and talk about accuracy is spell checking and other very useful things that’s out of the box is spell checking funny, and spell correction. So you can see an example here. I’m just zooming in a bit. Apollo 11 was the space flight that landed the first humans on the moon. As you can see the words like flight and first are automatically detected and corrected as wrong spelling. And same thing, if you go, you can open this in column. Same thing, install the imports, the pipeline defined. And [inaudible] give you an example. So please allow me to introduce myself, I am a man of wealth and taste. Just about 10 spelling mistakes here. And you can see a visualization of the original, this with the correct revision.
Please allow me to introduce myself, I am a man of wealth and taste. I mean the only mistake this does is here. Not correcting this to and. Everything else, eight to nine mistakes were corrected. One very important thing to know about this model, it is for establishment of state of the art because it can use context. So it can create basically the same mistaken word in different tokens based on the context of the words around it.
So with those examples this is basically how we use the library out of the box. And with that, let’s talk a bit about accuracy and what it means in this [inaudible]. The most important thing to know about the word state of the art, is that is not a marketing term. It’s a real academic term. Claiming state of the art in anything, it means that you have a peer review paper on a public academic benchmark with others validated and can reproduce it. It shows that, for example, you have better accuracy than any other published paper on the same benchmark so far. And this is what we mean when we say that we train to deliver state of the art accuracy to you. What you see on the slides of the examples of those specific in the medical space. And as I mentioned before, this is where the company focuses, on clinical and biomedical NLP. Where I’ll tell you right now, if you go to papers.com it tracks over 4,000 different legal books. And you go to the medical named entity recognition category. You’ll see the [inaudible] and Spark NLP hold eight of the 11 top places, for the 11 leader boards and metrics that are collected there.
Some of them are, as you can see here on the left, are biomedical NLP. So being able to extract things like chemicals, anatomical parts, gene, gene products and so on. And some of them, like you can see on the right here, along the clinical side. So being able to extract in clinical concepts, of diseases, drugs. We’ve all been able to correctly identify. So the identification is one of those tasks, really only in the past few years, we can finally do as accurately and as humans in an ultimate profession. Which is really changing that industry.
For the open source library, one of the most common use cases, not the most common use case is name entity recognition. And the most widely used benchmark is the coNLL-2003. Spark NLP prides itself on not just having [inaudible] implementation. Not just having a kind [inaudible] implementation, we’ll actually having a lot of custom features and bricks within the implementations that enables this to really deliver the most accurate system for production system. I think there’s was one academic paper that does kind of two thirds of transfers and more.
But if that is an academic paper, meaning that the complainant say it’s not usable. If he wanted something that turns today five lines of code, widely used in production that you can depend on. With the options that models Spark NLP is your best option. And we have kept it that way but really implementing it in our basically every year for the past year, different in the implementations. And with us in general, our secret is that really we do not have one. So it’s not that we have a specific algorithm or specific model that’s slightly more accurate than others. It’s really the fact that we release software every two weeks, whenever a new paper comes out. If it is better, we try to reproduce. If it doesn’t, we’ll give it a production grade implementation as soon as possible.
And another important thing is for you to get the best accuracy right off the box. So for example, with, with NPR, if all the default parameters are really what you’d expect and have to get the optimal accuracy, and which really makes it very easy to go into the produce our results. And they’re actually the open source notebooks, and you go to that spot can be workshop online that would enable you to do exactly that. Another thing that’s obviously important these days, if you want to achieve state without accuracy stands for learning and the ability to use reuse transformers and invent things so that you can basically apply transfer learning, but by using what the bendings have already learned and reduce the amount of customer data or customer documents that you need to bring your own models spoken and being really most of the models that have achieved a state of audits in recent few years.
And well over there, hundreds, if not hundreds of different types of embedding some transformers that come with the library and enable you to make different trade-offs. So as, some specific domains like the legal domain or the healthcare domain, or older, painful, new German or Bengali, you should use those also trade-offs between and base and you need, and I think at least 24 different sizes of embeddings, right? And you have things like we see Liebherr, right? And other basic condensable bits like afraid of memory and size. This was both in an optimal way.
The other than any hour in other, a task for we’d spoken will be claims until, until someone does better than the community practices is multi-class and multi-level disc text classification. So we be shown, we saw one simple example of text classification, which is always, is this fake news or renews, and you can do the same as the spam or not spam. Is this positive sentiment or negative sentiment multi-class classification refills to the case where more than two classes, basically, yes, we can say, oh, this is a news item. What kind of music? And we’re low, and maybe it’s sports, maybe it’s business, maybe it’s ecology, right? Maybe it’s politics. And it could be one for 100, 5800. And by default, the trainable multi-class classify within spoken will be the deep learning one supports 200 classes. The thing that’s interesting about this classified label, meaning that you don’t have to choose only one label for the class of document.
And you can see an example here, one of the pre-trained label plus deep learning classical is that com or toxic content detection. And as you can see up in the top, right, and I’m not going to read those out loud, but you know, it’s one sentence. It can be an identity attack as well as an insult, as well as an opportunity as well as the threat, right. As well as just as well as it being sexually explicit. So of course you could all think of sentences that more than one labor would apply, right? And if you want to find toxic content and also know what you’re dealing with, you’re looking for something that’s not only multi-class, but also multi label. Right. And would explain the countries and slogans for each label. So that comes pre pre-trained with Spark NLP. You can also print your own multi-class, multi-level multilingual classifiers using a whichever kind of either will be embeddable sentence in building, and the best feature use case. She loves some of the ones that come out of the books.
So the learning the classical, the playing the sentiment analysis, the planning and what trainable models they can all use. They all have this variety of will. It will the end sentencing buildings. It may have meaning that you have really a lot of choice for really simple models than need one smartphone and have two or even I know the browser, right. And have to really optimize how much memory they take, right? So, so some use case becomes a liquid that we need a 10 megabytes sentence, sentence building model. In other cases, really, you just want to make sure you optimize accuracy at all costs like technical use cases. And then it’s fine. Look, if you need to hold the 5 gigabyte embedding Morgan memory, that’s not proven. So we support all cases in other, a state of doubt model that comes out with both is the language classifier that language detects all the L it can recognize his country, different pre-trained models, the loudest one being able to detect 375 different languages.
It’s smaller ones can do. I think how about 75 to a 100, 80, 20, 60? So it really depends on what your use case is. These models are small, they’re on three to five megabytes, meaning they load very quickly into memory and they work very, very quickly. And depending on the language, the accuracy real thing is between 97 and 99% accuracy. And the only caveat that needs to be longer than one 40 characters, I suppose, if you have a single word, I mean, that could be Italian, that could also be Romanian, right?
And we’d all present these worlds that appear in multiple languages. If you have at least 140 characters, that’s usually what you measure accuracy, and which is still fairly from the context. Spellchecker, eh, as I mentioned, is able to look for specific patterns. And as the name suggests, looking to context it on top of it, you can also, you can train you on spell check and people have done this in four different languages or four for specific business domains.
You can also add custom dictionaries. I talk. So if you have things that only exist within your company or a specific drug on a specific project names, a product names, just what they will you use, you can add this is custom buttons in general. One of the things that I don’t talk about here is for each of those models, assume that you have quite a few features and configurations that are required require the use cases, which is really part of having kind of an industry that is system one benchmark we have is what’s called the whole book, a benchmark. It’s kind of really the main academic benchmark we found for evaluating spell checking automated spell checking systems and comparing to jump split.
And she’s just another widely used Python library for spell checking. If we make less than a third of the number of pills it has on the whole book, bedroom, accuracy aside, eh, assuming you have something that works as accurate as it can, right on the whole state of doubt does not mean that it’s always correct. It means that it’s as, as correct, as accurate as we can, we know how to do it today. The other super important thing that we care about is speed and accuracy. So, those things have to work in reward systems.
And so but the today is one of the things that would really take most of the time. And the profiling efforts in those systems are using embeddings using lounge embeddings. So other than using some kind of preplanned transformers and optimizing them for use in before, trade-offs one thing that we worked a lot on and released and spoken with the 2.6, it’s really just building a custom implementations of how we load transformers and how we use them based on some really recent papers in this space on how to implement, then I’m in shape and how to optimize memory.
As you can see here, we are able to improve memory consumption by 30% and improved performance of just, just influence by 70% just by, by smart loaning smart inference. Is that something you get out of books we’ve spoken a bit, we kind of no other predators in another thing that you may want to do depending on use case is to be able to decide whether you want to use cloud based or medium mobile, small, a little bit tiny, or build mini and the please 24 of them out of the books, which basically provide some, some trade-offs the most accurate one will be large, but overly, if you use a few tiny, depending on your use case, you may only use one or two percentage points of accuracy and have a model that is, for example, is 24 times faster and 28 times, eh, I’m sorry, 28, 24 times smaller in terms of memory.
So it may well be worth your while I took to have a model that can reply in eight milliseconds in terms of it’s still 40 milliseconds and takes five seconds to load into memory when you start your silver instead of two minutes. So all of those options are yours. And as you saw it really it’s, it’s really just a configuration change is when you, when you add your gold. The other important thing that we do for the community in the industry is to work very, very closely with both Intel and Nvidia, to make sure that we deliver optimized builds for both of these. So the graph you can see on the right is, and he said an internal benchmark on the second generation and scalable processors to make sure that we actually have customized being with the MKL library, especially TensorFlow library that make use offer the deep learning instructions on those chips.
Similarly, within video, we’ve just recently, especially with spoken LP, I mean, they’ve done work to make Spark Apache Spark and GPU aware, and they’ve done quite a few organizations, data frames. So we’ve done some work with them on that. And also somewhat they call deep learning tasks and key building blocks to make sure we get the most out of it there. And really the most important thing to know there is that we do all of that for you. We make sure that that spot can be really as optimized as it can be for whatever you put in place other than speed. And another big thing for Spark and has been able to scale. And so there, zero code changes that required to scale the pipeline into any Spark cluster is sewing the code when we defined the pipeline, it’s a Spark ML pipeline, and then the pipeline it’s really the same thing.
It’s perfectly still utilizable, but then it’s also, it distributes to a cluster really with no code changes. And we will be benefit here from Spark itself is from all of the optimizations that have only exists with inspiring around execution planning, cashing, minimizing, shuffling, minimizing serialization, optimizing serialization formats. And so all of those very, very highly distributed computing problems are some of the things that Spark those really well. So really we’ve just done the work with the source ecosystem with data breach to make sure we make the most of them and use them correctly and give you the benefit. And you can see on the right one example, really just, just scaling. And this is training your own limited recognizer and seeing how things scale was the cluster. One thing to know, and you can see in the list of caveats is that of course probably completed is not magic.
And the speed up you will get when you go to class, that would heavily depend on the task. And right. So for example, if you’re doing inference and the mode is betray, then you can previously inference the course, the class, then you’re going to get nearly nailed Speedo if you’re doing things that natively therapy, right? Like for example, in training a new RNN that you’re going to see somebody payments and you’ll have some examples of that. And so this is a decent benchmark looking at the full data sets of the Amazon reviews. So these are 15 million sentences, 1 55 tokens, and you can see how they work. This is also on Databricks with only AWS within fairly weak the servers. And you can see the speed of that. We get going from a single node to 10 nodes, just going to also bring them entity recognition, right?
Tokenization is a fairly simple tasks not deep learning based is to actually see superly now scale up. Or you can see when we attend notes, we get more than 10 times the kind of number of words per second, that we can process. We do any art. You do see speed up with it. You see somebody else speed up, right? So we attend notes, the number of words per second was less than, and this is really what you need to, what you need to see, right? Because it’s going to be very, very dependent on your specific pipeline of specific use case.
And this is another example of the same dataset at this time, looking specifically at calculating birds. And we’ll go up and say, this is also, these are, you can see these fairly small machines in terms of memory in 32 gigabytes, but we are using multiple cores and we are using definitely of course, multiple machines and letting Spark to all of you. All of the optimizations is one thing, though, if you have a thousand machines, you definitely want to understand the budget Sparking detail and how to optimize its configuration. What usually if you’re dealing with 10, 20, 60 machines, you should expect things just out of the box speedups.
So, and we spoke about accuracy. We spoke about speed. We spoke. What’s the scalability. What I’d like to do now is take a few minutes to show you some code. So you can evaluate themselves how easy to use and the libraries for different use cases. And it usually the three kind of definitely types of users. Either. You just use a third pipeline, which means, look, I just want one line of code and kind of just do the same as well as you can. I don’t care it. Sometimes you come and say, look, I don’t want to tune my own pipeline because I have my own language or I need to use my own stop. Would we move along my own stem oval? I need to train my own document classifier. So I use some things and train some others. It all, you may want to just run everything on your own.
And so, as you saw, the key component in spoken will be the key obstruction is a pipeline. So we take the text and we take it to stages lights, or we split in two sentences. We split it into words. What do you mean stop words we may limit, right? Then we calculate, what are the embeddings then maybe we planning them at the organizer, right? It’s like motions and so on. And that’s, that’s, that’s the basic obstruction. So the simplest thing, you can just use an existing pipeline. So let’s look at the code here. And in this code, what we want to do is, is we just want one.
So we do pipeline. We call it a Spark and start. And this is really GPL with all you need to ask for the GPU, build the DGP organizations. If you are hunting the GPU, the next thing we do, and that’s, that’s really a one-liner we load the pre-trained pipeline. So pipeline, it was pre-trained pipeline. Explain document VL. So what this pipeline does is it takes the document. It splits it into sentences. It tokenizes, it does some basic fix cleaning and then a calculus thing, embeddings word embeddings. If it’s the NLP I believe this one is out of the box, but there’s also kind of looking at birth if you don’t want to choose birth okay. So it does all of that pipeline. Any one line, what this line does is kind of go load the pipeline, then load all the models despite and requires load any embeddings or other configuration policy requires roll them into memory.
We either locally, or if you need, it will be cool if you need it to distribute across the course of the class that you does all of it. And then you have a ready pipeline. Eh, then you have the text and the one line of, to actually annotate the text to get your results is a result equals pipeline based on the text. And then you can see what your result is. Okay? The result is always just a simple typing dictionary. Okay? So you can see, you can see the tokens, you can see the entities, right, and see everything that the processing did. Right? And in this case, you’ll see that William Henry Gates, the third born he’s an American, right? And we’ll actually find those Dawkins and the type of each entity that was discovered. So this is really all you need to do. You start Spark, you load the pipeline and you call pipeline annotate on the text that you look into it.
They could be a case where it comes. Okay. You know, this is nice, but the problem with the pipeline is that, eh, I cannot really configure it. And if I want to change it step, so all those steps or multiple models, then I need to be in my own it. So here’s how you do that. You input the library and then you configure your pipeline. So the first stage is looking at the assembler. So you take the text to create the next phases. You get to tokenize it. So from the document, you create a new column called token. And this might, this is how it works in Spokane mill. So we add columns for the data frame. We work in each position, step populates, one or one or more additional color, right? This is first of all, the Spark inhale and design principle.
It also make it really easy to populate like inference, right? Because I work on different columns or different documents can happen easily in different machines. In this case here, what we do, we do stop removal. So we remove stop words in. So we get a new stop cleaner and just use either stop hold removal in English. So form the tokens mean we get clean tokens and we set it to be not case sensitive. This is by the way, how you can pick specific steps. If you need to. Then we calculate a building beddings and calculate them on the tokens without stoppers, okay. On the clean tokens, if we gave any bidding skull, and then we load the pre-trained will be blooming model in the Albert in English. We have the pipeline. Once we have a pipeline, you already know the only two methods that matter don’t fit to train your own or transform to, to actually apply context. So in this case, for example, this is how we add stop or cleaner. I use that. We configure it. You can also, you can use your own tokenize, your own sentence later. You own your own customer. We made things, whatever we need to be the case.
The third thing you may want to do is decide to print your own models, okay? Which looks exactly the same. The only two things that you need to change. You need to actually load the training data. And instead of calling dot transform, you need to call the feet. So here, if you look at the code on the right, we do the imports, we start Sparking LP. We load the training data and you can see, we have a helper classical POS in general, we have helped with classes, upload training data, into data frames for all the types of training data sets that spoken to be support. And so it’s usually a one-liner if you’re kind of a CSV, always kind of a classic, a training data file that you, you got online, or you made yourself, and then the pipeline is we create the document, right?
So we document the assembler, we use the sentence detector. So we split the text into sentences. Then we tokenize it. So we speak each sentence into words, and then approach, right. Then we tell it all the input columns of the sentence and the tokens, right? Because it will detect out of speech. You need to know what word and what the world is about within the context, right? Because of course the same word, like will run, could be, for example, either or over, depending on what it is in the word. So a part of speech trigger needs to have the sentences need to be tokens within the sentences. It has the outputs, a POS about the speech, but then because we are training it, we also need to tell it, okay, where is the auto speech column, which is tags, where’s the label. And then you can also set number of iterations if you want to configure the training rights or airports or 11 engagement.
So for UK the pipeline and then its pipeline, okay. Once you have the pipeline, one important thing to know pipelines are serializable. Okay. So if you want to be produced the training of the story, the whole thing, and that’s one thing, the other thing that’s obvious by design here, but it’s important to note is that use the same pipeline for training and for inference, right? And that’s important because if you do influence you, of course, it’s critical that use the exact same sentence detector, the exact same tokenized, right? So for example, remove stop words. You have to stop words when you do inference as well in having the code done the same way it makes, it makes it much easier, right? You can print. Then you consider like the whole thing. When you load it for inference, you know that you’re using the exact same configuration models and versions.
This example shows you how to train you on name entity organizer, which is really exactly the same thing. We, if you look at the code on the right, we load the import libraries, we start Sparking to be discussed with you because if you use built in buildings, GPU would actually help us. If we have a coin, the little helper functions, which reads a training dataset for automation into memory, we use an equals 30 medical pre-train. So we, we do use, we use pre-train built in by things, right. It’s okay. They’re going to calculate the embeddings on the training victim. Well, then what we train is the actual name, right? So now the eloquence and there’ll be an approach and then we configure it. Okay. So the source is which document, what is the token? What are the building? We have lot, the labels are right. So this is kind of the liabilities between against the output column as well.
And then we said things like max apex learning rate, bitesize maybe validation, split, and a very important evaluation log extended, which is true. So we have different options about how you want the system to be in terms of what the people speak to you. So you know how you’re doing in terms of and so on. But once you’re done, it’s very simple, anyhow, and mobile equals pipeline that fit screening data, and you get to model, okay, on the model, you just do the transform on a data set. And that’s you that you inference same code local in a class they’ll sample CPU and GPU, same pipeline, a training and inputs. Okay. And obviously be able to change the kinds of embeddings you want to, and these are, these are some examples for the sample that you saw on the previous slide, some of the results you get, right?
So the top, you can see real time, the accuracy per label, which is more of the final result. And below you can see also that the per airport, and you can also get basic, more detailed logging. And then you can output if you want to float to other tools, to track your experiments and see it when you’re opening it. One important thing in terms of speed, to run this training on basically on a laptop. And I’m sorry this so in this example, you actually want to do on Google Cola is we send on Google CoLab and 16 gigabyte. When did you view basically something you can do on your own g-mail account right now on the tire Cornell 2003, training datasets only took 16 minutes. Okay? So these things will not generally require lots of clusters to do right in novelty, quiet days and maths. This is not like calculating your DPT three, those kinds of things. Really, you should be able to go on, call up, should be able to go on your own machine. If you have 32, gigabytes is great. If you have 16, you’re this one.
Okay. Just one more example here. You can also bring your own document classifier. So this an example of how you train your multi-class classifier it, right? So in this example, really, you probably can understand the code on the right already, but basically what you do. And you have a training data set that has two columns. It’s a CSV, two columns. One is the Stitcher, which has the news item. And another one is called configuration because we, how do we categorize these new news items? Right? So is it sports or technology or business or politics or weather, and all you do is you start Sparking LP with the GPU as the one line you with the training dataset, like suggest a CSV with two columns into a data frame, and you set up your pipeline, which is its document assembly. Also take the document for me calculate in this case, we usually need the universal sentencing buildings.
So we load them able to pre-frame ones, and we use them to calculate sensitive buildings, and then we train or document classically. Okay. So you set that the pipeline, and you’re good to go at this point. I’d like to show a few more examples or some other things that you can do with Spark NLP. Before we summarize, we are back to the start. You’ll be in action at the demo step here. And if you go to infer for meaning and intent, if there’s some other interesting and your task responsible, we can do, which really I’m on show because you know, having mold and Fulton, a hundred models and different tasks that does a lot, you never get the cover. So one thing I would recommend really just come and explore. And if there is a test the text mining task, and you’re not sure if we do it, you can just come ask on slate.
We’ll see if you can find it here. So one example, and also be an out of the box model is text summarization, which we do with. So it’s really, it’s a very close to state of the art right now, at least one English. And we also have multilingual models that can do this fairly well. So we can see here, this is one example of a text, and basically we just really, without any parameterization and summarizing these looks, it could be the ethical calculus. And some of that is much more heavily. You can configure by the way, within the model. How much do you want to summarize? Basically, do you want a more aggressive or less aggressive civilization? This is an example on the Melissa. So Melissa is to have them both and we can see the summer is here in the summer here and right.
And really, if you want it to be 30 words or 50 words or a hundred words, you can do that. And of course, if you want to do it in Python, that’s there as well. And another interesting thing is to be able to be filled with meanings from context. So NLP used to be, it used to be focused a lot on linguistic. Now it’s really focused a lot on meaning. And so what this can do is look at two sentences and look at the word and tell you whether the group means the same thing, both sentences. So for example, the expanded window, we give us more time to catch the themes, or you have a two hour window of timing in your, in your homework.
So the word window will say, okay, does it mean the same thing? And here the one that says, yes, I think he does, which he does. Because we’re looking at the window of time. But for example, here, let’s look at two other sentences. We started to snow. John Snow was an English position. We’re looking at the word snow and the model, the model means two different things within those two sentences. So this is very important if you would be to being semantic search reputations, we really any, any search applications where you want to understand, okay. If I’m just doing a keyword search and getting lots of junk. Yeah. How do I know that things are actually relevant?
Another very useful thing that you can do out of the books is the semantic similarity between two sentences. Okay. So for example, look at this, look at these two sentences, sign up for a mailing list to get three offers and updates about the product versus subscribes notifications will receive information about Wisconsin Norfolk. So the model tells us these sentences of 66% similar infancy. No, very similar. Although if you look, if you look at the text itself, the only word that these two sentences show is the word about but it’s more than most do they use as embeddings and other Greeks semantically. These are, these are very similar things. This is very useful. First of all, if you blink things like spend detection, but also for example, billing a customer service application, and you get an email from a customer and you want to find, for example, what’s the most similar support article knowledge based article that we have plans will be switched, right?
Then these are the kinds of things that they’ll use today to implement those kind of use cases right now, finding most similar product, most, most similar to this. Right? And so on, I told you, I only remember these words from the song. Can you give me the rest of the song? So let’s say overall, the quite a few other use cases here, it, those automatic question on serene and both open and closed questions, translating the mobiles here around translation. So the language detection does also translation in more than 100 languages. And the last thing I want to cover, there’s also a lot of work that done around it. Sure.
Different types of clinical entities and specialized entities, different types of pre-medical entities, the relation extraction entity normalization in a lot of word ACIAR optical character recognition, especially on the very noisy and low quality images. So I’d like to talk a bit about this. I have to conclude. So Johnson led his school commercial software libraries that are licensed on top of Spark canopy. One of them is spoken before healthcare, which provides state of the art production grade implementation specific clinical and biomedical healthy. So they’ll stipulate implementations for clinical anteater permission, also different, different, it’s a different code base. Also it things like social status detection. So telling between patient is diabetic patient is not there be the patient. So symptoms of diabetes. So patient’s mother was there as well as the relation extraction. So finding relationships between different entities, right? So did this happen before or after this event or taking this medication?
Is this symptom started after taking invitations over 30 different types of relationships? So if you made this any clinical, definitely try it out and see if it works. We’ll use your own new space because this really provides the state of dark and accuracy as well as scalability to the industry and the other licensed library spoke Sparkle CR it actually, I liked what the next one, this goes beyond optical character recognition and really do automate an understanding of visual documents. And so it will automatically read and extract sexual data pay bills or just new information instruction or PDF digital stands for images in, for medical imagery. So DICOM images, pathology, images, images, and other formats. It also is the pipeline model, which is very useful because it has, you can see on the transform part of the slide, the sort of image, enhancements, filters, reach and enhance accuracy, especially if you’re dealing with things like old foxes with photographs, with a mobile phone, and you have things like no fading in hiding imagery, natural sayings, and faded images for the paper and stains.
Some of the things you can see the examples, and especially if you need to extract information for low quality images or documents that may be able to help you. So those are just some things to consider, but really, if you want to start, definitely start with the open source library. See if you like the API, how plastics perform for you, how will you post all your new space? And other than that, there are tons of other online resources. We saw the Overman hub with demos on notebooks, which I just showed you online, but they’re also under the learning hub there, know those are the face studies, especially with the focus on industry case studies. So beyond kind of the nuances of how you train specific models.
There are a lot of lessons learned on how you actually put those systems in real production cases and scale them dilute privacy compliance responsibility issues, which I went to south form. You’re really trying to put a real system and using these kinds of technology. So at that point, thank you. And thank you for listening. You are really the two main links to the live demos notebooks page and to the getting started page. Other than that, I’m always happy to listen and learn from what people are doing and frankly do so. Please feel free to reach out to me if there’s any. Thank you.

David Talby

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natura...
Read more