Modern data applications need to operate at speed and on data at scale. These applications often require a steady flow of large-scale data processed through Apache SparkTM pipelines. However, it can be a challenge to provide the other half of the equation: the sub-second query performance that user-facing applications need and that users expect.
One solution is to pair Apache Spark with Rockset, a high-speed serving tier that uses converged indexing to accelerate applications on the data processed by Apache Spark.
In this talk, we will discuss an investment analytics use case and how several investment management firms built user-facing applications for investment insights. They enrich publicly available data and alternative data in Apache Spark in both continuous and batch processes. Once indexed in Rockset, the data from Apache Spark can be joined with internal data sets to support investment decisions with sub-second queries, even for complex analytics, at 1000s of queries per second.
Joe Hellerstein: Hi, my name is Joe Hellerstein. I’m a computer science professor at UC Berkeley. I’m also the chief strategy officer and co-founder of a company called Trifacta. I’ll tell you about both of those, but today I’m mostly here to talk to you about rocket ships and washing machines and maybe how they apply to data and AI. So we’re here at the data and AI summit, and it’s really exciting to be here with such an enormous group of people. And we’re here of course, because we’re excited about what’s coming in technology, but it’s worth stopping and asking in the midst of all the talks at this summit, what has data and AI technology done for us lately? What has it done for you lately? All right. And to put this into perspective, let’s use the lens of economics. So the 20th century, if you think back all the way to 1900, when the Wright brothers moved to Kitty Hawk, what an incredible century of progress in technology. We went from gliders, which were the first devices that the Wright brothers were working with to jet engines and commercial airliners.
And we broke the sound barrier and yes, indeed, we built rocket ships and went out to outer space and even created space shuttles that could go out and back to earth. Really amazing, amazing technologies. Now, the closest I got to this technology personally was when the space shuttle was being retired, they flew it over my office at UC Berkeley. You can see it there in the picture, and that was pretty cool, but I can’t say that it had an enormous impact on my life directly. Now in that same 20th century, there was a quieter and really much more profound technological progression that you see in productivity technology. So the best example of this is domestic work. Here’s a picture of domestic workers, very typically women washing clothes at the side of a river using rocks and stones. Over the course of the late 18 hundreds machines were introduced to make domestic work more productive.
And through the course of the 20th century, those machines became so good and so ubiquitous that we take them completely for granted today. And you can ask yourself, just stop for a moment and ask yourself, what would your life be like if you had to wash your clothes by hand at the side of the river? You’d get a lot less done and we would get a lot less done as a society. It’s basic economics. Productivity technology, over the course of the 20th century, the hours of work on domestic work went down from 60 to less than 20. And the number of women in the workforce, whether part-time, or full-time went from 10% of the workforce to 50% of the workforce. It’s a remarkable unprecedented introduction of labor outside the home for the first time, where domestic work did not dominate the lives of half the population.
Now, when we think about the technology we’re using for data and AI, it’s easy to get excited about the analogy to rocket ships. Think about engines that can work on petabytes and petabytes of data or machine learning algorithms that can really learn from what’s going on in the real world and predict the future. I know we’re all excited about this, but we might start by asking ourselves, where’s the high leverage productivity technology being developed and what does it look like? And a place to start looking at that is inward. Where does productivity technology help us as the people working with data?
So the context where a lot of the innovation is happening today with data and productivity is in the data engineering cloud. Keep in mind that every company has access to the same algorithms these days. They’re coming out of academia and they’re being released in open source toolkits and being adopted very quickly and improved as we go. And every company has access to the same computing power as well. Gone are the days where only the big players had the big computers. Now we all have access to as much compute as makes sense for our economic needs. The real difference between successful organizations and ones that are struggling is the data they have and the work that people do with that data. So there’s two key aspects to productivity technology as it applies to really anything. But let’s look at it specifically in the context of data and AI. The first is efficient use of scarce labor.
And the second is the increase in the labor pool. In both of these cases, the simplicity that is brought by productivity technology empowers everyone. So the acronym here being SEE, it’s not just for the laborers who began in the market, those scarce laborers, but really for the entire pool. So let’s keep track of that. And we’ll see how this plays out over the next few slides. This first aspect, efficient use of scarce labor. Back in 2012, DJ Patil before he became the chief data scientist in the United States, wrote a book called Data Jujitsu and he pointed out that 80% of the work in any data project is in cleaning the data. Well, fast forward, 2018, AWS, re-Invent the VP of machine learning for AWS says the hardest part of AI is the data wrangling and somehow as the years are going by, we’ve still got these super high powered people talking about washing our data on rocks and stones by the side of the river.
Big data Borat probably said it best, in data science, 80% of time spent prepared data, 20% of time spent complaint about need to prepare data. It’s time to do something about this right? Now, looking at the labor pool, when we started out in 2012 at Trifacta, we were looking at the revolution in data science and really worried that we wouldn’t have enough data scientists to do the work. Well, things have changed. There’s a fascinating blog post by Vicki Boykis, who you can see is in the lower left corner from her bio, is a working data scientist out in the field, who trained from data analysis to data science, really an interesting voice of people working in the field today. And in this blog post, she has this great point about how data science is different now.
In fact, arguably, we have a glut of people who’ve been trained to do data science, in the sense that they can pull down a Jupyter notebook and run a machine learning model. But once she goes out and talks to them and says, well, what do you have to do every day? The answer is once again, are things like preparing data and getting it ready for modeling. And this is reflected in the labor pool today, where the number one growing job is not data scientist, it’s data engineer. This is the work that has to get done to be productive with data. So we might say that we want transformational engineering and set up a moonshot for data science, but you know what, that’s not on target. What we really want to do is we want to engineer a transformation of the way we work with data.
What we really need is a productivity revolution for data workers and it’s time that we all got on board with it. So modern data engineering really is this process of taking data from its source, connecting it, structuring it, cleaning it, enriching it, validating it, and then deploying it into practice into various different kinds of usage from cloud data warehouses to AI pipelines, to reporting and analytics. And this has to happen every day, all day. So what we’re doing is we’re building repeatable engineering pipelines. Interestingly though, this work is work that has to be done by everyone. It happens in a business context when there are business users who want to get going and just get cracking on data themselves. Of course, data engineers will build some of these pipelines, but data scientists who’ve been trained to mostly think about, say AI and analytics, also need to be able to do this work on their own.
And so what we need is productivity technology to empower everyone and all three of these personas and more, should be able to work with the same ideas with the same data and collaborate with each other without artificial boundaries between them based on tooling. So the data engineering cloud introduces what I think are three key vectors of simplification to enable everyone, SEE. So the first is the cloud itself and the idea that there aren’t barriers to get going, that relate to ops teams. You don’t have to have ops in the loop, if you want to get started in the cloud, you just sign in and get moving. And that’s really powerful, particularly for tools that data workers can work with that are cloud hosted. The second is the unbounded, but also pay as you go scalability of the cloud. So if you’re somebody who has a small budget and wants to do a little bit of work, you can get a bite sized piece of the cloud.
But if you’re a team that wants to get an enormous amount of work done very quickly, you can paralyze that work to an enormous number of machines and just get going really fast. So that ability to scale up and scale down is a signature of the cloud. That’s really something quite new and this combination of the ability to do simple things, as well as the ability to do incredible things is what makes the cloud a place where entire teams can work together. I like to say, there’s no walls in the cloud. We don’t end up with artificial boundaries between teams based on technologies and the access to those technologies. Now, the second vector of simplification that I think is radical in modern tools for the data engineering cloud are bringing AI and machine learning as assistance in the work that data people do. So this is human in the loop data work.
And so an example of this are guide/decide interfaces where the user is guided by AI assistance to some options they can choose from, but eventually decides what’s appropriate for the data and the use case at hand. Another example of this is to use the AI, not for sort of large whole scale problems that are hard to explain what the outcomes would be, but instead for individual bite-sized problems, where a user can go into the tool, have AI assistance and come out and evaluate what the AI just did in a small task and make sure that it’s appropriate because at the end of the day, humans are often on the hook for the work that they do with data. They need to be able to be interacting with the AI recommendations in these kind of bite-sized explainable pieces and where this all comes together is a field called Program Synthesis.
It’s a branch in computer science that’s really taken flight in the last decade where software generates software for you. All right? And so the idea is that you don’t necessarily have to write the code, but what comes out the other side of your process is code and it’s code that can be represented in a high level language, like even a natural language for non-technical users, but can literally be compiled down to Python or SQL or Spark for users who want it in that form. And then they can, of course, hack it, edit it, modify it in its native form. The third vector of simplification that I can’t stress enough is visualization and visual interaction. And I chose the acronym SEE, to just emphasize this. When we work with data, we are working with data. We are not working necessarily with code and so at all times, we should have our eyeballs on the data and often we want to directly manipulate that data.
At every step we should be seeing what is happening to the data when we take that step. So we want continuous feedback and interaction with our data. And then again, we want all those visual interactions translated down to code through software synthesis. So what happens is a visual experience for no code that meets your code, that you might want to write by hand in a single framework. So this combination of cloud, AI machine learning assist and visual interaction are the three vectors or the three signatures of the modern data engineering cloud.
So to make this concrete, I want to take you through the technologies that we developed in our research at Berkeley and Stanford, and then brought to market via Trifacta. I’m going to have my co-founder from Trifacta, Sean Kendall, do a little cameo, walk you through a demo of the product in action and some of its key features. And then I’m going to show you some technology that we’ll be releasing later this year, that’s still in the lab that shows you how you can access the easy to use features of Trifacta from a coding environment, and then bring things back from Trifacta into that coding environment again.
Sean Kendall: Today we’ll be preparing data for a credit risk analysis. We’ll be using two datasets. Information about the individuals we’ll be issuing loans to, as well as information about the loans themselves to get started. We’ve loaded a Databricks table into the Trifacta transformer view. Here, we can see the raw data itself as well as a visual profile at the top of each column. Trifacta automatically detects data types and uses that information to provide an appropriate visual summary. For instance, here we can see the distribution of values in each column, as well as a data quality summary. Here, we see that a number of values are missing. As we scroll left to, right, we see a quick snapshot of our data, as well as any potential issues that we might need to address. In the phone number column we can see that there’s a number of invalid phone numbers. Selecting the column gives us more details here on the right.
For instance, we can see that Trifacta’s pattern profiler has actually identified two different types of formats inside of our phone numbers. So let’s get started with our first transformation by standardizing this column. Now, instead of writing code to clean up this data, we’re instead to use Trifacta’s transformation, by example. We can simply provide examples of what we want the output data to look like.
As soon as we give an example, Trifacta automatically suggest a transformation to apply to the data and shows what the output will look like. Here, we can see most of the data looks good, except for a few examples, which we can go in and update. When we provide another example, the underlying transformation suggestions update, and all of our data now looks good. Let’s name this new column and add this transformation step to our recipe.
A recipe will include all of the transformations that we add to our data and we’ll talk more about what you can do with the recipes in a bit. But next let’s add some more transformation steps. Here, we can see that the credit agency is actually embedded in another string. Instead of writing regular expressions or other parsing logic to pull this data out, we can simply select the data we care about. As we interact with our data, Trifacta suggests potential transformations to apply it. We can scroll through these suggestions and see some quick previews, as well as looking at the grid for visual preview. Let’s provide another example and Trifacta suggestions will update. Once the data looks good, we can add this step to our recipe.
Next, let’s standardize these titles. We’ll begin by formatting it so all the titles are in the same case. And then you use Trifacta standardization tool to identify any potential misspellings or potential duplicates in the data. The standardization tool will automatically group values that look similar, since these clusters look good, we’ll use our auto standardized feature to automatically map each value in the cluster to the most common value. And again, we’ll add this step to our recipe. Now that we’ve added a bunch of transformation steps, let’s talk about the recipe itself. The recipe serves as an interactive history. You can go back through each of these steps in the recipe to see the data at that point in time. For instance, if we go all the way back to the beginning, you’ll see the raw data before we apply it in new transformations. We can also select any steps in the recipe to perform a number of operations. We can disable or enable steps, copy, and paste them, move them around, or even combine them into reasonable macros that we can share with others on our team.
Additionally, we can parameterize these steps of the recipe for more operational use cases, but for now, let’s add some data quality rules to ensure that the data we produce conforms any expectations we might have. To add data quality rules, we can either add them manually or use Trifacta as adaptive data quality, where we will automatically suggest rules based on the data itself. Let’s view the suggestions. Here, we see Trifacta provides a number of suggestions such as the member ID must be a valid integer as well as be unique across all records in our dataset. Let’s add a few rules.
And once we add these rules, we get a quick summary of the data quality at this moment in time. And we’ll also be able to apply these rules later, as new data comes in. Now that we’ve cleaned up this first dataset, let’s join it with the other information. We’ll select another Databricks table. In this case, the information about the loans themselves. Trifacta automatically suggest the potential join key to combine these datasets. We’ll then select the columns we care about. In this case, we’ll include all of them except for the redundant member ID column and we’ll add this join to our recipe.
Now that we’ve combined these two datasets, we’re ready to publish this out to another Databricks table. To do so, we’re going to run a job on Databricks leveraging Spark. By default, we’ve configured to publish this data to a new table each time. Let’s take a look at some of the other settings. Here, we can see we have options to publish this data as a Delta table or as an external table, as well as some other options. But for now, let’s keep the default settings. And run the job to publish the output. While this job is running, let’s look at some other features of Trifacta. The flow view in Trifacta provides a high level summary of the datasets that you’re working with, the transformations you’ve built, as well as the output that you’re producing. Here, we can see the input datasets that we joined together, the recipe we created earlier, as well as the output that we’re producing. We can also add notes and annotations to describe our work. The flow view is also very interactive, so you can add recipes or reorganize the flow.
We can add datasets to this flow view, if we want to augment our analysis with any other data. We could load additional data from Databricks or from any other number of connections, including applications, databases, and file systems. We can also share our flow with other team members, set up a schedule to run these flows on a recurring basis, or set up notifications or configure this to run with other third party services. For more advanced orchestration use cases, we can use Trifacta’s plans.
Plans can be used to orchestrate tasks within Trifacta, as well as tasks outside of Trifacta. For instance, here, we’ve built another flow to handle any failure cases with the flow that we just built and we’ve also added a call-out to a third party service to process any high-risk loans.
Now, let’s return back to the flow we built and check out the output that we’ve created. On the job results page, we can see a summary of the job that just ran, as well as more detail about the data that was produced. On the profile page, we see details about each column, including the distribution of values, as well as a similar data quality bar that we saw earlier and the rules page shows a summary of the data quality rules that we created earlier, ensuring that our data is high quality for any downstream analysis.
Joe Hellerstein: So hopefully Sean’s demo gave you a more tangible sense of how the data engineering cloud can bring together these three vectors of simplification, the ability to access things simply through the cloud, the use of AI, ML assist in these bite-sized pieces, and then rich visual interaction throughout the experience. What I want to show you next is technology that we announced at Rangle summit last month, that we’re bringing out of the lab later this year, and it’s technology to bring your code and no code together in a seamless way. The example we’re going to see is Trifacta being integrated into a Jupyter notebook environment with Python and Pandas. We’ve also announced similar technology to integrate Trifacta into an environment of SQL and DBT. So without further ado, let me show you that Jupyter and Python integration. So to begin, we import the Pandas and Trifacta Python libraries into our Jupyter notebook.
And then we’re ready to move back and forth between Python code and Trifacta’s AI assistant visual interface. Next, let’s load up the member info data that we looked at previously, but this time, instead of getting it from Databricks, we’ll take it from a CSV file on our desktop. And here we can see the data frame that contains that data and we’ll have a careful look at that right most column as we go. Now, our next task is to pass that Python data frame to Trifacta. To do that, we’ll use Trifacta’s Rangle API, which is going to give us back a Rangle flow object and we’ll call it WF.
In the backend, Trifacta is creating a full featured wrangling flow that we’ll be able to look at later in the Trifacta user interface. To see the data in that Trifacta transformer UI, we just invoked wf.open, which pops open a new tab for us to work in. And when we’re done with that tab, we can return here and continue coding. So here in the Trifacta transformer interface, we can look at our data, scroll all the way to the right, to look at that field that we wanted to extract the suffix from. And we can use Trifacta’s visual interaction by giving one example, and then a second example. And by that time, the AI can recommend the right recipe step for us. Keep an eye on this recipe because we’re going to be translating it to Python.
Moving back to the notebook, there’s a bunch of facilities that we can access through APIs, to generate outputs, data profiles, and other information from our flow. To start with, let’s fetch all the profiling metadata from Trifacta in a programmatic form, as a single json object. This is useful if we wanted to write Python code that works with the profiling information. Now, as you can see, what we get back is a pretty large json object with a lot of detail in it. And in many cases, what we actually just want to see is a visual overview of the profile. So we can pop back into Trifacta with a simple API call and see Trifacta’s visual profile.
And as we scroll through the columns, we can look to the right, and we can see that new column that we generated and it’s full distribution. Returning to the notebook, we also might want to get the data out of Trifacta’s run, as a data frame in Python. So, there’s a simple API to do that, which is just WF.output. And here we’re back in our familiar Python data frame and again, this is the data at the end of the run. So if we look at our right most column, there it is, having reflected the transformations done in Trifacta.
Now, in some cases, we don’t want the data from the Trifacta flow. We actually want the logic of the flow converted into a form that we can integrate with the rest of our code. So in the Python ecosystem, what we’d like to get is Python code, using the Pandas library. Trifacta’s backend can auto-generate that Pandas code for us with a simple API call, get Pandas and even populate our Jupyter notebook with the result. If we were in an SQL and DBT interface, then what we would see here is a DBT yaml file that was pointing to a set of generated SQL files.
Once we have this resulting Pandas code in Python, we can test it out on our original dataset DF. So here, we’re just strictly running Pandas code on a data frame. It’s just that that pandas code was auto-generated from our visual interactions with AI assist and Trifacta’s data engineering cloud. And that Python that we’ve generated is suitable for checking in to get or emailing to a friend or posting on Slack. It’s a totally independent piece of Python that you can put wherever you like to put code.
So hopefully that demo illustrator for you, these three vectors of simplification and how they can empower everyone. See, but what I want to do is transition now from individuals and technologies to organizations. I like to talk about the data engineering cloud as being something that fosters T-shaped teams. What do I mean by that? Well, sometimes people talk about data scientists having to be T-shaped people, you should hire T-shaped data scientists.
What does that mean? Well, it means that they should have depth in the domain that you’re applying the data science, whether that’s business or the sciences themselves. But they should also have breadth, right? They should be able to use lots of different tools and be quick learners. And so on. Realistically, it’s very hard to build sustainable large organizations out of only T-shaped people. Instead, what we want to foster are T-shaped teams, where there’s different breadth and depth mixes across different constituencies and we organize them so that they can work together on a shared platform where the work that they do has no walls between it. Everybody can access the same tools and the same data through a simple cloud interface. And so for some takeaways, the first and the one that I want to stress the most is you should always see your data and I mean that both in terms of the acronym, but actually also in terms of the visual experience of working with your data. Data engineering is not software engineering.
And so you shouldn’t be using tools that are focused primarily on the software. You should be using tools that immerse you in the data and that keep you immersed in the data as you manipulate it, as you change it, as you generate the code, that is your pipeline. That will ensure that your code is reliable, that it’s appropriate to the data and that it doesn’t have errors in it as you go. Secondarily, of course we should embrace algorithms and AI to speed us along the way and to assist us in calculations that we, as humans are not good at. But let’s keep in mind that those algorithms and AI need to have human in the loop. They need to be in their place. Here is a canonical example of AI in data preparation. That’s not quite doing what we want. Now this, of course, isn’t the AI’s fault, in some sense.
We just need to give it a little more help and it’ll figure out what to do here. But the point is that we need a human in the loop, because at the end of the day, you are the one who’s responsible for the outputs of your pipelines, not the AI. The AI is a heuristic. Human in the loop is absolutely required to validate these things. So we should use AI and algorithms as bite-sized steps in our flows, overseen by people.
And finally code synthesis is a key piece of the technologies that will make this all possible to bring no code and your code together. So if you are working in a visual environment, make sure it’s an environment that can generate code for you so that you can integrate that code with other software artifacts in your data pipelines and similarly, so that folks who generate software artifacts can interact with the folks who were working in your no-code environment. Well, T as we know is for transformation from ETL and ELT, which is the lifeblood of data engineering. Trifacta is very proud to have brought the technologies from campus into the market that define the next generation of the data engineering cloud. We’ll see you there.
Venkat Venkataramani is CEO and co-founder of Rockset. He was previously an Engineering Director in the Facebook infrastructure team responsible for all online data services that stored and served Fac...