Everyday Probabilistic Data Structures for Humans

Download Slides

Processing large amounts of data for analytical or business cases is a daily occurrence for Apache Spark users. Cost, Latency and Accuracy are 3 sides of a triangle a product owner has to trade off. When dealing with TBs of data a day and PBs of data overall, even small efficiencies have a major impact on the bottom line. This talk is going to talk about practical application of the following 4 data-structures that will help design an efficient large scale data pipeline while keeping costs at check.

  1. Bloom Filters
  2. Hyper Log Log
  3. Count-Min Sketches
  4. T-digests (Bonus)

We will take the fictional example of an eCommerce company Rainforest Inc and try to answer the business questions with our PDT and Apache Spark and and not do any SQL for this.

  1. Has User John seen an Ad for this product yet?
  2. How many unique users bought Items A , B and C
  3. Who are the top Sellers today?
  4. Whats the 90th percentile of the cart Prices? (Bonus)

We will dive into how each of these data structures are calculated for Rainforest Inc and see what operations and libraries will help us achieve our results. The session will simulate a TB of data in a notebook (streaming) and will have code samples showing effective utilizations of the techniques described to answer the business questions listed above. For the implementation part we will implement the functions as Structured Streaming Scala components and Serialize the results to be queried separately to answer our questions. We would also present the cost and latency efficiencies achieved at the Adobe Experience Platform running at PB Scale by utilizing these techniques to showcase that it goes beyond theory.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hey guys, this is Yeshwanth Vijayakumar. I’m Project Lead at Adobe Experience Platform for the Unified Profile Team.

Everyday Probabilistic Data Structures for Humans

Today we’re gonna talk about, Everyday Probabilistic Data Structures for Humans. Specific emphasis on the human spot, you’ll kind of see why.

Today’s goals, I wanna keep it as simple as possible. The first one is to add some interesting tools to your data processing belt. The second thing is I wanna instruct this, introducing you to it. I also want you to show how to use the tools and apply it. So, what’s not in scope is the internals of the data structures. There are way better resources than me, who know what they’re doing and what they’re talking about, so I’m gonna leave it to them. So this is gonna be mainly from an application point of view, so that at the end of this talk, you feel like trying this out on your own.

Daily Trade Offs

As part of our daily engineering lives, I would say the three pillars that you always constantly keep looking at, is cost, latency and accuracy. How are we trading off these three things in order to get our final goal? That kind of is like the daily theme of our lives.

When I said for humans, we need to make sure that this is, what do we say? It echoes well with you. So we’re gonna take an e-commerce company as an example, let’s take Rainforest Inc.

Simplified Example Workflow

We’re gonna look at a simplified event workflow that they have. A user logs into the page, they visit a product page, they add some stuff into a cart and then they purchase it and then we see events get generated for it all the time. Now, if we were to take a look at what such an event would comprise off, the significant fields would be like, the productId of the product obviously. Let’s stick just to eventTypes. It would be an enumeration of say, PageVisit, AddToCart and Purchase. And also the userId who is clicking into these pages and making these actions. The totalPrice, and then the sellerId to indicate who’s actually selling this. And then also the ipAddress.

Now, with respect to the scale of events, to just, kind of give a context as in why we make certain design decisions throughout the doc.

You can skip everything else but the main thing that I would actually pay attention to is like the size of the daily events that is kind of flowing in to our system, just like one TB of data. So a lot of you right now would be like, hey, one TB is not a lot. But then compounded with factors like say, if you’re going through like 30 days of data, 30 TB adds up pretty quickly. If you’re making repetitive queries for data that’s constantly streaming in, you’re gonna see this add up pretty quickly.

Some Interesting Questions

So, now that we framed a hypothetical company, let’s also frame some hypothetical product requirements. The first thing is, has a user visited this product yet? The second question would be, how many unique users bought items, A, B or C? And then the third question will be, how many items has seller X sold today?

Simple enough, so now what are we gonna do, is we’re gonna have these three questions as our base, we’re going to use certain data structures to answer these questions in kind of like a cost and latency effectively.

What are we going to trade off?

So what are we going to trade off right now? Like we’re gonna kind of give up on accuracy and going to go very heavily on reducing costs and reducing the latency to answer our questions.

So, to get into probabilistic data structures, the three data structures that we kind of want to look at today are Bloom Filters, the next one is Hyper Log logs and the last one is Count-Min Sketches.

Which ones are we going to explore?

We’re gonna go kind of like semi deeply into these, and we’re going to focus mainly on like the implementation part and the end to end design part as in how to use these.

So let’s start off with Bloom Filters.

With respect to Bloom Filters, like I said, we’re not gonna go too deeply into how Bloom Filters work, the internals or whatever.

Bloom Filter TL;DR

I’m gonna, all of my slides, what do you say, the slides they’re gonna refer to, the talking points with respect to, even if you don’t listen to anything else, just follow these points and I think you will be able to utilize them at least once somewhere. So, it mainly answers the set membership question in a probabilistic way. I would look at it as a replacement for say like a normal set.exists of key, if you’re looking for a one to one comparison. The main question it translates to is the element that I’m looking for possibly in this Set, not in what is the emphasis on the possibly. If yes, what’s the probability that the answer is wrong? If the answer is NO, in that case, you are guaranteed that you definitely don’t have it in there. And the last important thing, which will make a lot more sense in the next few slides is loss free Unions. That is, we are able to combine them in a loss free way.

So, given that, we’re gonna sidestep into like abstract algebra, where we’re going to go through like a category of data structures called like monoids. And this is, it might be like the just stocking of a Bloom Filter is why we jumping to monoids. Trust me, it’s gonna to make some sense as soon as we finish this train of thought. So with respect to what monoids are, what I, the way I would look at it, is that given that you have a set M and an operation OP that we have defined. We’re gonna look at various properties results whether it has closure, that is then you do two things as in when you perform an operation with two elements, does it still belong to the same set? That’s the closure property. Like for example, string a plus string b gives string ab which is again a new string. The next one will be associativity, basically it tells you that the order of grouping for the operation does not matter. Like string a plus b plus c no matter how you group it by the final result is abc. So that’s one example. And the last and one important thing is the identity. So, such that if you perform an operation with the identity element, you still end up with the same thing. So in the case of a string, if you were to take an example, it will be like an empty string. So, any string a plus an empty string, no matter which side it comes from you is still get string a.

The Map/Reduce Boundary – Shuffle And why having a Monoid is neat!

Okay, so now that you’re done with monoids,

let’s get into, let’s now put together monoids with Spark before we get into the Bloom Filter later. When you look at the map produced boundary basically, the shuffle, aggregate functions in Spark specifically, like, you know, sum, count, they kind of trigger shuffle in, what do you to say, in order to move data, it is locally aggregated in one node to, what do you say, to kind of get across the node global competitions. So you can see like each executor produces local aggregates and then the local aggregates get put together by the shuffle process in order to get the final computer aggregate. Now, if you put together this with the monoloid fact, it will make a lot more sense because we are, what do you say? The properties of the monoid exactly help you to find a very simple aggregation function. So, in other words, if you have a monoid, if you have a data structure, which is a monoid, it’s a very easy way to say that you can convert it into a Spark aggregation function. Now, keep that in mind and let’s see. So I went ahead, forget this entire wall of text over here. I went ahead and wrote my own Bloom Filter aggregator, not a thing big but then if you focus on the first part, which is the zero, so you can see that this is the identity element, where we are just creating an empty Bloom Filter. And the second part, you would see that the merge step where you’re taking two bloom filters, like b one and b two, you put together, you get another Bloom Filter. So here, what helps is the associativity and the closure property. So, now that we’ve exactly seen how monoloids help in aggregations and how we’re able to express a Bloom Filter as a Spark aggregate.

BloomFilters as Aggregate Functions

We can easily see that it becomes trivial to express it as an aggregate function to use on our preferred data frames. But the good thing is Spark already took care of it in a way like using the data frame stat functions, they already have a Bloom Filter implementation. So all I need to do is Bloom Filter off whatever column you want and the extra set settings, and you can create a bloom filter for it, no worries.

So next, what we’re gonna do is, how do we exactly use this Bloom Filter for to answer one of the questions that we have.

Solve it incrementally!

But before we get to that I just want you to pay attention to this slide, because what this is gonna tell you the lee ways while there are a lot of examples showing Bloom Filters, Hyper Log Logs, Count-Min Sketches, they all show it in a batch fashion. Sure, you have bunch of data residing in a data lake, you load it into a data frame, you create some filters out of it, you’re done. But then, a lot of the times in real life, it doesn’t exactly help because we have data streaming into our systems all the time. Take, for example, the Rainforest Inc case, you have data flowing in all the time and you can keep orchestrating like a batch process that keeps incrementing these bloom filters all the time. But then at some point we kind of want like a neat framework to keep doing this incrementally so that we don’t need to go scan all the data again and again. So what we’re gonna do is we’re gonna use art streaming. Now we’re going to create our sketches the probabilistic data structures in the executors, and we’re gonna write it to like an external store constantly. So for every microbatch that keeps going on, say like for every 30 seconds or something like that, we’re gonna gather the data and write it to an external store and then the apps, and you can have any amount of apps, reaching out to the external data store, and consuming these probabilistic data structures to answer your questions. So, now given that, and that set in stone, let’s get into actually answering some questions now.

Has User visited this product yet?

So has a user visited this product yet?

If you look at the ingestion workflow, before we get to the code, let’s just do a small step through for the pseudocode wise. So, for every ingestion microbatch, that data that is say, if our trigger intervals is 30 seconds or something like that, so every 30 seconds, you’re gonna get a data frame with a bunch of rules. So what we will end up doing is we are gonna create one Bloom Filter for every product. So if you have b one, you’ll have b one hyphen Bloom Filter created for it and so on.

In the map step, we are going to create, we are going to use the key as a productId and the value emitted will be the Bloom Filter. In the reduce step we’ll combine all the Bloom Filters from each partition and get it into one. And in the foreachBatch set, we’re going to update it to the externalStore. So at this point, I wanna switch over to show a demo that I have. So I have a notebook where I have defined event class over here.

~ Gal View: Code~ © Run Al @ Comment: Event Record Definition

So our class event, so it has a productId, evenType, userId, blah, blah, blah. So you can take a look at an example structure of this, like product five six one zero, addToCart, user that’s logistics and so on. I went ahead and generated some synthetic data already. We’re not gonna go through all of that right now. Also the external data, so that we’re gonna use for all our demos is like, so that connection has been set up already. If you’re familiar with the Spark pain of encoders and offered decoders, you, this will feel very comfortable for you, for kryo.

To simulate the data into the stream. I already wrote it into active file. I’m just gonna read it into an event stream and that’s what we’re gonna run out of. So getting into the Bloom Filter, back, to create the Bloom Filter, if you go line by line, what we end up seeing is that for every batch in the event stream, so we get a handle on the batch data frame and the batch ID. We’re gonna spin up first, the list of unique products in a given batch. So once we have that, we have the list of, for every microbatch we have, p one, p two and p three. Now, given that we have the list of unique products, now we know that for this microbatch, how many Bloom Filters we’re going to create for the microbatch. So for each, what we’re gonna do is we’re just going to use as in to keep things very simple for the more we are gonna use the Spark Bloom Filter.

So, once we know what product it is, we’re going to create a split product and we’re going to create a Bloom Filter product. And then what we’re gonna do is we have a convenience function here called update Bloom Filter, which updates it for every product and this is the key structure that we are gonna use for the Bloom Filter. We’re gonna take a closer look at the update Bloom filter part, because that’s gonna tell us how exactly the monoloid properties actually help out facts.

So, if you look at the update Bloom Filter, you can see that it takes in a new bloom filter and then also takes a key. So first what we do is we check if there is already a Bloom Filter in Redis.

If it is available in that case, what we do is we fetch that and then we merge it in place with the new Bloom Filter.

And then if the Bloom Filter does not exist, say in that case, it’s much more simplified as we just, what did we say, put a new one over here. So this would be the zero property, and this would be the associative enclosure property. So we just are using this as our external shuffle mechanism in a way in order to make this more efficient. So, since this is a, what do you say, in a presentation, I didn’t want to start the stream right now, but I already done it ahead of time. So, what we’re going to do right now, is we’re going to query the Bloom Filter. So, we already populated the Bloom Filter with the thing, but then we can create an outline. So what, so let’s see what question we had. So if you look at the entire data, we can see that we have it something like, okay, let’s look at for productId eight seven, eight, eight. If I did a filter for that, I see that we have a good amount of identities over here. And then I can see a good amount of users have also, what do you say, gone through it. Now, if I want to check if a particular user has actually, what do you say, visited this product? All I need to do is fetch the Bloom Filter for this product. This is again a convenience function for this. And we call the mightContain on top of it. So, what would we say, oh, looks like the show just got shut down. But yeah, no worries, we just ran it sometime back so the results are pretty fresh. You can see that or what do we say, we can see that the result is true, that this is available over here. The key thing to note over here is the timing. So this took only point one, six seconds versus this taking 1.79 seconds. As we get more and more data this time, well, linearly increased no matter how many indexes or whatever you’re going to put, if your grade is have like a full table scan kind of nature, this is going to increase linearly. But then this query is going to remain constant because in terms of the actual operation, this is a fetch from Redis for the Bloom Filter and then actual operation on top of the Bloom Filter. So this is quite efficient in that way. So, this should give you a good idea. This basically translates to a set of exists.

So we’re gonna go over Hyper Log Logs now and see how it helps us answer the next question. Again, if we were to take a 10,000 feet overview on the Hyper Log Logs, that’s, the main question that it answers is how many distinct elements do you have in the set? This is analog is to set that count so that you can see how many unique elements are there.

The great thing about the HLL is that you can estimate amazingly large cardinalities with a very good amount of accuracy using just 1.5 KB of memory. Of course you can put in a lot of different, features on top of it, and flags on top of it and trade off for accuracy versus space. The other neat thing is loss free Unions, like we solve for the Bloom Filter and again it’s a monoid. So, all that voting theory that we heard in the beginning of that translates over here to see like why HLL becomes a very central aggregate function in Spark or any distributed computation for example.

So, and I know a good amount of people in the audience are gonna be like, okay, wait, they both look pretty similar. Like you’re, you kind of, you’ve set as the goal common thing over here. What’s a huge difference between vendor one user versus others. So, without getting into too much of details, I would kind of put it into two blanket rules like this, for cardinality estimation, HLLs are better as you keep adding more elements into it. This blog over here has some very good plots showing that. For membership testing, use a Bloom Filter, there is no other, as in, it’s just super simple.

So now, getting to, you know, this HLL monoids. So, in terms of support operations, you add a hash to an HLL you still end up with another HLL of 1.5 KB. The merge and the cardinality operations, again, are very symbolic of the monoid and the cardinality being very specific for the HLL itself.

How many unique users bought Items A, Ingestion Workflow

So, getting to the question that we want to answer. How many unique users bought items, A, B and C? So if you were to again, go through the ingestion workflow, you’ll see that on the ingestion microbatch, similar to what we did for the Bloom Filter, we’re going to create a HLL for every productId. And then, this is where we’re gonna start slightly start deviating from the previous approach where we were doing a good amount of computing, the Bloom Filter creation within Spark, but in order to keep things refreshed, what we’re gonna do is, you’re not gonna do the data structure population inside Spark. We’re actually gonna take it outside to just expose you guys to another backend. So in the ForeachBatch, you’re going to group by product and do a local abbreviation and then have users. And then you’re going to update it to an external store where in our case, the Redis, is the external store. And we’re going to have the HLL actually as a data structure inside Redis. That’s one cool thing that I like about Redis I’m a huge Redis. And so you’re gonna see me when I say gosh all time. So enough of talking and let’s actually get to the second part of our demo to get to about the HLL part.

So, similarly for HLL, what we are gonna see is, you’re gonna go over the event stream and for each batch, that is, they’re gonna get the handle on top of the microbatch data frame, which is going to have like, the events that have happened in that microbatch in that time into, once we get that, we’re gonna get the mapping of products to users. So basically it’s a group by product, and then we’re gonna make sure we filter only for the purchases. So the final output would be like the mapping of productId to user one, user two, userN with all the people who actually purchased it in that microbatch. So now that we have this, what we gonna do is for each product in this derived data frame, they’re going to push to Redis. As a HLL, a purchase HLL for each product. So a product one would have a product one hyphen purchase HLL with the list of users who have participated in that microbatch. Now the key difference here is like via directly pushing it to the Redis. So, some keen observers would notice that this is kind of similar to, since we have a local aggregation here for this microbatch, for the product is kind of like equivalent to a combiner stamp. So we’ve already gotten a combiner step, a combiner output over here. So all we do is just push it to Redis. So, now the key thing here is, again, the monoid comes into play because you could have multiple threads actually wishing data for the same product.

But then it doesn’t matter because even if you have three operations for the same product, getting cured up on Redis, since it’s single threaded the associativity property for the HLL would kick in and no more and no matter what order of operation it happens, the end result of the HLL is to (murmurs). So that again is how being a monoid is going to help us in terms of scaling it out over here. So, what we effectively have is as in when the data is coming in real time, we are updating our Hyper Log Logs for each product, with the list of users who are actually import the product. So again, similar to the Bloom Filter, I already populated it ahead of time. So, what we’re gonna do is first let’s take a look at the actual data. So that sets the context to compare it with what we’re going to query. So if you look at the full purchase data frame, that we’re grouping by product, and we’re looking at only at purchases, you can see that you’ll get a list like this, every product and the list of user, the list of unique users who have bought that product. So, now given this, this is the actual data. Now what we’re gonna do is we’re gonna use the BF count method to access a given product, products Hyper Log Logs and they’re gonna see how many entries are actually that, that’s enter bread and butter use case for the Hyper Log logs, Cardinality estimation, so you can see that this has five, so cool we got some answer, this product exists and we got, as you can see one six one nine, yeah. One, two, three, four, five, okay. So far so good. Accuracy is not being treated over yet, so that’s fine. So, the next thing, but then the actual question that we wanted answered is users who bought as in the total set of users who bought A, B and C?

So what we’re gonna do is, we can perform a union operation with the head chillers because we already saw that merge operation. So what we’re gonna do is we’re gonna look for product one six one nine four nine four four, and three nine, three nines HLLs. I’m going to combine them and see what the total count for it is. So we’ve got the count is 21. Now say if we do a query on top of this data frame for only these specific products that they’re mentioned, what happens? So I already done that ahead of time, so we can see that, let’s look at the account. So there’s finite like, yeah, 22 unless my math’s going really bad. So yeah, so, but this is 22, but then we got 21. But the key thing to note here is that we’re looking at, how should I put it, a union. So if you look at this example, product four, nine, four, four, and three, nine, three, nine, they have a common user here, one, two, four, one, five, and one, two, four, and five here. So, we can double count them in the union. So we actually got the right answer over here, so whoo, good stuff over there. So, yeah, and again, similar to the Bloom Filter standings.

So this took 1.6 eight seconds after we already generated this data frame, which initially took 35 seconds. So because of the group by and all that stuff. So, but then this took a constant 1.14 seconds. So I can guarantee you that no matter how many more entries you add, as long as our configuration is right, for the Bloom Filter, obviously gonna get an answer kind of like around the same ballpark, but then this time, 1.68 plus 34 for like 37, 40 seconds, this is gonna keep increasing as you’re looking at more and more data over the days. So again, we treated as in, we have gotten now, a nice data structure through which we can the are updating it in real time, and we’re able to get answers in real time by turning off a bit of accuracy on that. This is a good amount of compression here. So, now let’s switch over to the Count-Min Sketch section to try to answer the last question that we have. So, for Count-Min Sketch again, if you look at a 10,000 feet overview, I would look at it as a space efficient frequency tables are to look at it in very plain terms, a hash table replacement. Again, I’m dumbing this on really thing. So, please do apologize. I do apologize if people see this as too much of a generalization.

But yeah, the next thing is, oh, wait, space efficient. Then what do you mean by space efficient? So, it takes sub-linear space instead of order event. Good for us. You might lead to some over counting, but that’s fine. We are saving a good amount of space here. The technically logical extensions of Bloom Filters when you start digging down into the internals, the cool thing is it’s monoid. Again, can’t emphasize enough how important the, as in it being a monoid is for our users case specifically if it’s Spark.

How many items has seller X sold today?

So, now skip, now that we saw that, okay, Count-Min Sketch is a frequency table, hash table replacement. Let’s try to answer the question, how many items has seller X sold today? This is nothing but a frequency question that you are, if you were writing a secret query, you’re gonna be like select stat from table, or like you’re gonna be grouped by seller and you’re going to be getting count based on the sellerId. So, going in the same workflow or the intern design that we have, let’s look at the ingestion of workflow from all this. For the ingestion microbatch what we’re gonna do here, is we’re gonna create a Count-Min Sketch for the sellerCount for data.

But then, the key thing is we’re gonna create a Count-Min Sketch for every eventType. Because if you know the question here, it said, how many items has seller X sold today? Like each seller because of the products that are associated with it, they could have three different types of events associated with them. So like a pageVisit and I took out a purchase. So what we’re gonna end up doing is if we want seller Y statistics, why not just generate a Count-Min Sketch, but if your (mumbles).

So, once we generate that, the reduced step becomes very simple where we are just going to merge in place the list of Count-Min Sketches we get from each executor. And then in the foreachBatch, just like what we did for the Bloom Filter, we’re going to update it to an external store so that it can be served and consumed by other users. Let’s switch back to the demo to see how it actually works.

Count-Min Sketch

So, for the Count-Min Sketch, we again created a helper method to update the Redis for, but then let’s look at the boiler code that we have. This is exactly similar to the Bloom Filter code.

The only difference is very where I see, as in, instead of doing it for the products I’m doing it for the eventTypes right now. And instead of a Bloom Filter, we’re creating a Count-Min Sketch. But apart from that, it’s kind of a word by word copy. I just changed the variables to be very honest and it works. It is good. So what we end up doing is for each batch for each microbatch that we have in the data frame, in the microbatch, what we do is, we get the list of unique eventTypes in that microbatch related of it. And then we do a split by getting a sub date of names, but eventType and then we end up creating a Count-Min Sketch using Sparks on data, for instance, functions on the sellerId count. And once we get that, we update the Count-Min Sketch in the Redis. So again, the, this is exactly similar, where we check Redis to see if an existing Count-Min Sketch is already there. If it exists, we fetch that and merge it in place and write it back. If it doesn’t, it’s a simple zero case so we just write it into Redis freshly. So I already ran it through for the data that got generated.

Now let’s look at the actual data, with respect to the group by conditions. So I group by sellerId and eventType. So on top of the whole data, when you look at it, we can look at, ok, seller one, one, six ,nine purchase, 23, blah, blah, blah. So we have an actual frequency dire table. This took two minutes and 30 seconds. A key thing to note is eventDF is already cashed right now. So all of the subsequent queries are happening pretty fast. So now what we are gonna do is use using the help of function for the Count-Min Sketch we’re gonna fetch that addToCart Count-Min Sketch for the seller. And once we have that, we’re gonna look at seller675, so 675, and addToCart. Let’s see how many the answer is, addToCut is 21, but then here 675 has 27, so that’s kind of odd. So it’s fine. It’s still around the same ballpark so we kind of okay with it. So, what you can see is we are able to figure out a single addToCart Count-Min Sketch, and I’m able to estimate the count of a particular seller that I know exists inside that. So again, if you look at the time 1.3 seconds, this is guaranteed to be around constant, as long as your Redis doesn’t crap out.

So another thing that we can do is, let’s now that we have the Count-Min Sketch for addToCart, can we see how many total events, did we actually addToCart eventsDF, for same the sellers. So we can look at it, the estimated number is 49,000 and we’ve run a query for that. We can see that it’s still 49,000. So good, again, we saved a heck done on time. We updated it in real time and we are able to get estimations in real time without having to run these queries on top of the entire data all the time. So, now I think we were able to answer all the questions, what to say with a good amount of accuracy. Of course, the data here, I just use like a million entries. So ideally I wanted to use say like more, but then you would hopefully that drives home the point that these data structures can be used in various combinations to get, to answer common questions that you have. Now, if you were to go back to the slides. If you were to look at the usefulness, the main thing that I want to drive home is that using these common patterns, you can optimize, like, if you know, a particular query is gonna get asked multiple times, and if it’s expensive, instead of trying to add multiple joint conditions or like, you know, let’s do (murmurs) join and all sorts of all, let’s do extra partitioning, all that’s good. But then if you are looking at like real time responses, it’s good to trade off the accuracy bit with the cost and the latency bit, using the probabilistic data structures given to us, there are a lot more interesting data structures, but then these three I use quite often, so that’s why I’m going for it. With respect to common examples where we could potentially use this. We saw for the hypothetical reinforcing, one place that I see that this we can use is ML training. So a lot of the times the features that get built, they’re very expensive. They take you, you have these huge bad jobs that are running and punching the numbers and generating this feature matrix. Instead, if you identify some certain aspects of it, you can, what do you say, just plug it in with probability data structures. Another thing is page personalization, in the example that we used right now, like we can have a custom background based on say a seller threshold, say if a seller sold more than five items a day, make it green. Obviously I shouldn’t be a product designer or a UX person, but then you get the point. Another cool thing that I would say is a bit more unconventional say, if you’re looking for like bad or like leaked password lists and all that stuff, you can build say of a Bloom Filter outer fit and ship it to the client. So, you don’t need on every keystroke you don’t need to keep making unnecessary over the wire server codes to check if this is a bad password or not. You can ship that logic into the client itself. It’s 1.5 KB, and it’s way lesser than the background image that we have right now. So that’s all that I have. Mainly, if you guys have more questions, feel free to reach out to me at my email.

I am going to upload the notebook that we have here so that you guys can try it out on your own and hopefully it’s useful. All the examples that we went through, are not our examples but then in the Adobe Experience Platform and in the Unified Profile Team, we’re making this a very generic and a reality to use these sketches at scale. So that is the profile somebody’s feature is powered entirely by the sketches. It’s basically a sketch on top of marketing data. So, look out for a blog on that, or read the blogs for that.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Yeshwanth Vijayakumar

Adobe, Inc.

I am a Sr Engineering Manager/Architect on the Unified Profile Team in the Adobe Experience Platform; it’s a PB scale store with a strong focus on millisecond latencies and Analytical abilities and easily one of Adobe’s most challenging SaaS projects in terms of scale. I am actively designing/implementing the Interactive segmentation capabilities which helps us segment over 2 million records per second using Apache Spark. I look for opportunities to build new features using interesting data Structures and Machine Learning approaches. In a previous life, I was a ML Engineer on the Yelp Ads team building models for Snippet Optimizations.