The Rise of Vector Data

May 27, 2021 04:25 PM (PT)

Download Slides

Modern Machine Learning (ML) represents everything as vectors, from documents, to videos, to user behavior. This representation makes it possible to accurately search, retrieve, rank, and classify different items by similarity and relevance.

Running real-time applications that rely on large numbers of such high dimensional vectors requires a new kind of data infrastructure. In this talk we will discuss the need for such infrastructure, the algorithmic and engineering challenges in working with vector data at scale, and open problems we still have no adequate solutions for.

Time permitting, I will introduce Pinecone as a solution to some of these challenges.

In this session watch:
Edo Liberty, Founder and CEO, Pinecone

 

Transcript

Edo Liberty: Hi, my name is Edo Liberty. I’m the founder and CEO of Pinecone and I want to tell you a little bit about the rise of vector data. To explain what we talk about when we say vector data, I want to tell you what happens when we look at stuff in the world, we people, what happens in our brains. At the first step, what happens is light enters your eye and hits the retina. The retina would be the equivalent of your CCD in your digital camera, those are the neurons that actually sense light, translate that neural to electrical signal that travels back to your visual cortex.
The visual cortex depicted here in this cube or a box thing is actually made out of several layers here numbered one to six, and each one of those layers performs a different function. This is an actual neuron not like the deep learning ones, it is a physical, actual neuron. In your brain, it’s in the very back of your brain, just above your neck. What it does is converts the optical direct signal of what pixel was registered, what value, to much more elaborate semantic representation, the neural representation of the image. Let me just go level by level. The first layer in that, V1, is called the first individual cortex would connect immediate dots and immediate adjacent pixels together and then start to understand things already like edges. The second layer, V2, already starts to understand things like contours and curvature. The third layer already starts seeing differences in depth and shading and so on.
The deeper you get into the process, the more processed the image becomes and the more divergent the presentation that you get is from the actual image. I’m telling you this because that representation you would think about as the equivalent of the vector presentation of data in deep learning. In our brain, once that neural representation is achieved, it’s actually transferred to other parts of the brain. Specifically for visual understanding and recognition, it would go to the temporal lobe where things that have to deal with face recognition, object recognition happen. I want to tell you… I’m telling you about this to make a point which is, everything that you know, everything that you recognize, do you remember having seen that you identify in the world. The object that your brain actually processes, remembers and deals with has very little bearing and very little relation to the actual image that your eyes saw, but rather the vector representation, the very rich semantic neural representation of that image is the output of V6, the last layer of your visual cortex.
This is incredibly important because this is exactly how neural nets and deep learning does data processing and understands the world nowadays. We’ll take images and transform them to high dimensional vectors. Those vectors are the… Think about them like the neural representation or the output of the visual cortex in a computer though. This is not your literal visual cortex, but rather a deep learning model that was trained on images, maybe yours, maybe others. Once we have the vector representation, we can go and perform higher cognitive tasks in the same way that our brain takes that representation and understand things, remembers them and identifies them. We do the same things in computer, in machine learning. We would take those vector representations called embeddings often and do objects recognition, the duplication, product search and retail and so on.
This is not limited to images. This is how nowadays we deal with texts. Long pieces of texts are transformed with neural language processing models with transformers and STMs and so on. Once we have embeddings for texts, again, we translate, we understand, we pick up on the sentiment, we do question answering, semantic search and so on. This is true also for audio and many other sources of data including user behavior and other sources of data. In fact, this has become the standard way of understanding complex objects in the world in general, and in machine learning and deep learning specifically. This paradigm has gained tremendous traction and so it’s pretty much de facto the standout in a lot of applications. It’s used and available through all the big, deep learning frameworks like TensorFlow and PyTorch, MXNet, Caffe and so on Keras.
You have very well known and available pre-trained models for vision like ResNet and AlexNet, VGG, SqueezeNet, DenseNet, Inception, GoogLeNet, MobileNet and so on. For texts, some well known ones are BERT and DistilBERT, Word2vec, GloVe and so on. In audio, you have Wav2Vec, mxnet-audio. Again, those lists, the dot, dot, dots are… They contain a very large variety of those models, those architectures, what data they were trained on and so on. There are literally thousands of those available for everyone to pick up and those are effectively the equivalent or the analog of the visual cortex for these other topics, and they are so easy to get. Actually those two lines at the bottom in Python would already download a pre-trained SqueezeNet model, which is a computer vision model that we see later in a demo.
The question that we want to ask and partially answer, is what if we save all these vectors now. Now we know how to convert an image or a piece of audio, text into this high dimensional, semantically rich vector object. The question is, “Okay, how do we move higher up into the higher cognitive levels, high cognitive functions in the brain. How we move away from the visual cortex and what happens if we save millions or billions of these high dimensional vectors. What can we do with that?” The answer is we can… There is a very rich array of functionalities that are possible pretty much any higher cognitive function you can think about.
But one of the most immediate things you can think about is to do something called similarity search, which is analogous to something like identifying objects or finding similar items that you’ve seen, if we’re looking at the visual analog again in the brain. What happens is we will take those high dimensional vectors, embed them in this high dimensional space and have a way to retrieve similar items, in this case by distance. You expect similar items to be close by, you can measure distances in different ways, and you would expect that you are able to actually retrieve similar items like shoes at the bottom here. You can identify there are shoes to begin with, but maybe you can identify that they’re similar in style or similar in color or similar in function.
I want to say that in fact, this idea and this mechanism of representing items as high dimensional vectors and retrieving them by similarity of rank or relevance is not new. In fact, you’ve already interacted with those technologies. If you’ve searched on Google or Bing, especially your long queries, went through NLP models to embed those into high dimensional vectors and retrieve results based off of that. If you’ve shopped on Amazon or eBay, then your shopping recommendation carousel is almost surely driven by these vendors embedding of you yourself and your shopping cart as a complex object and those embeddings of those complex objects as well. Plenty of research shows that that ends up increasing both search relevancy in Google and Bing, and shopping conversion in Amazon and eBay and other vendors. If you’ve seen and searched for music or images on Spotify or Pinterest, or if you’ve scrolled through your feed or your recommended list of friends on LinkedIn or Facebook. Again, those are ranked and represent…
Those items are represented by high dimensional vectors as embeddings and ranked and retrieved based on those mechanisms. I’m trying to ask a question, which is, if this is such an effective mechanism, why isn’t it more well known? Why isn’t everybody doing this? The answer is, in my opinion, it has to come down to the fact that to do this properly you have to use a new kind of infrastructure, a new kind of database in my opinion. I want to maybe go through… I want to just maybe say why I think it’s a new database. The main reason is the fact that it fundamentally deals with different objects and different operations on them. Key value stores really deal with a mapping or pairs of keys and values mostly, and deal mostly with efficient retrieval from that. Documents and databases or search engines that deal with documents really care about mappings of terms, words into documents so that you can do intersections or unions very efficiently and pinpoint a set of documents.
If you’re looking at the graph databases, they deal mostly with relationships and facets of complex objects, and they represent that very efficiently. A vector database or database dedicated to these high dimensional vectors, the objects themselves are these high dimensional vectors, these embeddings, and the relationships or the queries that you can issue are geometric in nature. What is close to what, what is within some box or within some distance away from something else. These are fundamentally different questions and different objects, and they require their own infrastructure. Let me dive a little bit deeper into how they’re diff… How this infrastructure is different. At the core, you have to index and be able to retrieve these vectors efficiently. And a vector, think about it as a long list of numbers. On the left hand side, you can think about that long list being two.
A length of two, so only two numbers. You can think of them as the X and Y axes of these black dots, and you can ask the query retrieve everything within the red circle. Now, intuitively it can say, “I can divide the space up into rectangles apriori, now that the red circle only intersects the grayed out rectangles and so I should look only there and that should be pretty efficient.” If you think that then you’re right, and in fact you’ve now reinvented a very well known algorithm called k-d trees. It’s incredibly efficient in two dimensions, but when you deal with neural data which is hundreds or thousands of dimensional, those algorithms break, and they break in very counter-intuitive ways. On the right-hand side, you see a three-dimensional image of what is more likely to happen in high dimensional space.
Where the analog of the red circle is a red cap, which contains almost no points or none at all. In fact, all the rest of the points in your data are very far away. In high dimensional space, that tends to be the case more often than not. It is counter-intuitive if you haven’t dealt with the math, but that’s the way it is as a testament to the fact that these are actually complex algorithms and they’re hard to tune correctly. There are tens of open-source libraries dedicated to just doing retrieval from collections of vectors efficiently. Each of those open-source solutions has at least several algorithms within them. Each algorithm with many different parameters to play with. Here I’m plotting just the trade-off between speed and of… Throughput rather, and what’s called recall on different algorithms and different parameter sets.
This is taken for management, ANM benchmarks, which is dedicated to benchmarking these sets of algorithms, so one against the other. Note that the four graphs actually talk about four different data sets. Not only are those algorithms complex and have many parameters to tune they’re also… In fact, which one is better, highly depends on your specific dataset. Another level of complication is that even if for your application you’ve figured out the algorithm and the parameters and the software to use and all that stuff, you’re still very far away from being able to actually use this. The first set of challenges has to do with scale and functionality. Things that have to do with sharding and replication for the size of the data and the throughput that you might be seeing, and you have to deal with live updates.
When you update an embedding for an item, you want this to be immediately searchable and immediately responsive for your future searches. You need to do things that… You need to add features like namespacing and filtering so you know you’re responding only with appropriate set of results for every query. You have to deal with pre and post processing with models. Now that we’re using embeddings, ETLs in pre and post processing in databases now become model applications. Now you have to deal with the complication of dealing with that. It doesn’t end there, but maybe these are… This is a good list. Maybe we can stop here. There was also a whole other set of challenges that has to do a lot more with production readiness. They have to do with the high availability, persistence, consistency, monitoring, alerting, and support, centralizing this in your org, putting this entire thing on Kubernetes deployments, being able to spin up and down in environments and so on.
All of that makes building such an infrastructure in-house incredibly challenging and explains why the super scalers and why maybe Fortune 50 tech companies were able to pull it off and take advantage of these tools and maybe medium and smaller companies just could not justify the investment. At Pinecone, we wanted to take this entire effort. We want to create a vector database. We want to create similarity search really as a service. We asked ourselves, “Can we in fact give that entire set of capabilities as a managed service consumable by a simple API?” We’ve done just that. We launched Pinecone a couple of months ago. It is now available to you to go and use. You can create services and applications and scale up and down with all the benefits of both the high efficiency and versatility of algorithms on the one hand, and the production readiness and scale on the other. Now we’ll switch to a demo to see how that works.

Speaker 2: Now, we’ll see a demo on how to build an image search application with Pinecone. The idea is to take a large collection of images, here on the right. Take a query image, maybe this seagull and look for other gulls or other birds in your data, in your set of images. To do that, I’ll go to Pinecone’s website. I’ll put in my first, last name and email and get an API key. I’ve already done that, so I don’t need to do that. Once you press Get API Key, you’ll get an email within a few minutes, and you’ll be able to use that in the service. I can scroll down to start my image search example, or I can look at it in the Docs, in the Examples section or in the Learn section, and you could just go off of the main website as well.
You’ll just click on Image Search. You have it in blog form if you’re interested, or you can click on the Colab link and open it as a Google Colab that you can run yourself after you put in your API key. Here, I’m actually switching to the Google AI platform and running it on my own GCP instance with a managed notebook. It’s the same exact notebook that you would be running in your Colab environment. What we’ll see here is an eight step process. We’ll install some dependencies, download a data set called Tiny ImageNet, and we’ll download a pre-trained computer vision model for transforming images to vectors. We’ll use that to actually transform the images to vectors. We’ll create an index. We’ll upload an index, all the images in vector form to Pinecone, and then we’ll perform some queries, look at results and target the index. Without further ado, let’s install some dependencies here. I’m installing the Python client. I’m upgrading my pip installer just in case I’m installing torchvision, pandas and [MotLoc] lib, if we’re showing some stuff. That is done. Here, I have to put in my API key which I have, and so I’ve copied. Give me one second.
Here it is. Now I have imported pine cone and I’ve initialized my client, and I can run now as an authorized user. I’m just confirming that the versions match, you don’t have to do it every time. Here, I’m downloading Tiny ImageNet. This is not a very large dataset and the GCP connection is very fast, and so I have… I think actually I’ve already pre downloaded those to this instance and so that is pretty quick. I just checks that the files are there, but it usually takes maybe 10, 20, 30 seconds. Here, now that I have the data, I can use torchvision. It has a nice utility that lets me iterate over the folder structure and create a flat list of data, of file names and labels based on the folder structure. Now we can look at those images.
These are just utility functions that print images to the screen and this is an image, not unlike the one we saw before, just a collection of those, like some food, some basketball, some corals and so on. This is a sample from our dataset. I maybe did not mention it. The dataset is 200… Sorry, 100,000 training images across 200 classes, 500 images from each class. Now, we need to create a model that transforms images to their embeddings, the vectors. To that we’re using PyTorch. Specifically inside with PyTorch, we use PyTorch to download pre-trained version of SqueezeNet. For each image we’ll perform some rudimentary normalization and then apply the model here. This class now downloaded this model and created a class that is able to transform images.
Now, I need to actually transform the images and that takes a few… Maybe a minute or two. Note that this instance needs to now apply this model for every image in my data and create this vector embedding. I noted that it’s possible to actually give Pinecone the models themselves and have Pinecone being charged with this step, but it’s more involved and I think it’s outside of the scope of this demo, and I think the main is already shown through already converting the vectors and uploading the actual vectors to Pinecone. We should be done with this fairly quickly. Here we’re only using 2,500 images for this example. What we’ll do later while this is running, we will split the data as 90-10. What is it? 2,250? Am I doing this correctly?
Images are going to be in my corpus, so these are going to be using my data and I’m going to use 250 images as queries. This is what this split does. We’ll run that in a second and then we’ll create an index. Here, I’m just naming my index. This is the first time I’m actually using Pinecone. I’m just creating a string, this is the name of my index. I make sure that the index doesn’t already exist. If I’m trying to create an index that already exists, it will fall on error. Here I’m creating the index. I’m telling the name of the… I’m giving Pinecone the name of the index. Here, it says you should use Euclidean distance as the measure of similarity so closer means more similar. It’s a relatively small index and so I can use one shard. You can use more shards if your data is larger. You can use replication if you need a higher QPS and so on, we’re not going to go into that here. This cell is now finished. I can now run this. Just look at my data frame. My IDs are just the names of the files on my local hard drive and the embeddings are these vectors. Now I can split the data and let me create the Pinecone service. This should take only a few seconds.
All right. Now we have a managed service, a vector index on the Pinecone site. This is a remote service of course, this isn’t happening on this machine. For me to connect to it, I have to create a connection, which I’ll just call index. Mind you the index itself is again, is a channel, is a connection. It isn’t the object itself, it isn’t the index itself. Here, I’m upsetting. Look, I’m creating index.UpSet. What it gets is a list of tuples being the IDs as the first object… The first item in the tuple and the vector as the second item in the tuple, and this is just an iterative over those. Here I’m running the UpSet. Note that it was incredibly fast. Pinecone already ingested an index of 2,250 vector embeddings corresponding to my images and their corresponding IDs
Look, now I have 250 queries that I will like to issue. Here I’m going to run those queries as a batch, as batches of size 100, so that would be three query requests. I’ll time it just so we can see what we get out of it. Here I’m getting a throughput of about a thousand items per second, a thousand queries per second. I want to stress that this is highly dependent on the network you’re operating from. If you’re in someplace random on the cloud, it might be that… It might be slightly lower, maybe slightly higher. If you’re operating on a Google Colab or some less performance environment, it might be lower and on your laptop at home, this would be quite significantly lower. Most services actually call to what’s called unit recalls. These do queries one by one, and that is of course significantly slower because this is not batch, and so I actually have to do a round trip of communication for every query. We see that we only get about 50 queries per second in this setting.
Finally, it’s time to actually look at the results. To do that, we’ll just print the relevant queries and then the images corresponding the results set. For gulls, we get mainly gulls. For corals, we get mostly corals. This is jellyfish, food, et cetera. I’m sure I can find somewhere where this isn’t perfect, yeah. Here for something frying in a pan, you get this what seems to be [inaudible] a bird, or is it alive, I don’t know. Anyway, so these are results for image search. Note that some of these images have… This image of a gull for example. The first one is very different than this last image of a gull. The comparison isn’t done pixel wise, it’s done in a much deeper semantic level for the comparison of those images.
Finally, when I’m done with my service… Obviously, if this is running in production, I would leave that on. If this I just a tutorial for me and I’m done, I can just delete the index and free all the resources associated it. Once I’m done with doing that, I’m not using any resources on the Pinecone site and there’s no service waiting to be answering questions, answering queries. That is it. Thank you very much. I suggest you go to pinecone.io and look at the examples and tutorials. I’m sure you’ll find something that you would like to try to do. Thank you so much

Edo Liberty: Again, thank you. My name is Edo Liberty. Please reach out to me at pinecone.io with questions and ideas. We’d love to hear them, and I believe we switch over to questions from the audience.

Edo Liberty

Edo Liberty

I'm the Founder and CEO of Pinecone, the vector database for machine learning. Until April 2019, I was a Director of Research at AWS and Head of Amazon AI Labs. The Lab built cutting-edge machine l...
Read more