Low-Code Apache Spark

May 26, 2021 11:30 AM (PT)

Development on Apache Spark can have a steep learning curve. Low-code offers an option to enable 10x more users on Spark and make them 10x more productive from day 1. We’ll show how to quickly develop and test using Visual components and SQL expressions.
Spark Frameworks can standardize development. We’ll show you how to create a framework, using which your entire team will be able to create complex workflows with a few clicks.

In this session watch:
Raj Bains, Founder, CEO, Prophecy.io
Maciej Szpakowski, Co-founder, Prophecy.io

 

Transcript

Raj Bain: Hello everybody. Welcome to our talk today. We’ll be talking about Low-Code Apache Airflow, and we talked about Local Apache Spark yesterday. Let us first start with our introductions. I’m Raj, I’m the founder of Prophecy. Before this I worked on compilers at Microsoft, I was one of the authors of CUDA at NVIDIA, and then I’ve worked in databases first as an engineer and then I was the product manager of Apache Hive Hortonworks. So, a lot of data engineering and now processes focused on making data engineering very easy for everybody. Also with me is the co-founder, Maciej, and he’s an expert in Spark data engineering, expert in Spark machine learning and he’ll be doing the demo today. So as we look at Low-Code Apache Airflow, it’s all about making Airflow super easy to use. So this is something that’s a new feature.
We are presenting it for the first time. So yesterday we talked about how data engineering is just too hard and too slow. How in development, standardization is hard. Enabling a lot more users and making them productive is super hard. Today, we’ll be focusing on metadata. For most of the people that we work with, metadata search and column level lineage is not something that they have on spot. So what are the use cases? So typically if you are in data engineering, let’s say you have a governance need and you have social security data in one dataset. Perhaps you have 10,000 books, those that are moving into many, many other datasets. So now you want to track this column, which all other datasets did it end up in. Column level lineage helps you do that.
Or perhaps you are changing a particular column. You’re saying, “I’m removing a column from a dataset.” But you want to know if other teams, other projects that are downstream is one of them going to get affected. So you want to do impact analysis or perhaps you’re the data analyst and you’re looking at a value and you’re saying, “Hey, is this… how was this value really computed?” You know what the definition is, but you want to know it was computed correctly before you make a decision on it. So for all of them, these use cases, column level lineage is super critical. It’s something that most of the people we work with want and don’t have. So which we’ll give you… we’ll show you how we do it and give you a demo of this. The second thing we’ll dive into is scheduling.
Scheduling Apache Airflow is good. It gives you a lot of basic parameters for scheduling, but scheduling on top of that is still very, very hard. As an example, we work with some fortune 50 companies. Some of them are like, “Okay, I am moving 20,000 workflows to Databricks. How do you want me to schedule them?” It’s like, “Should I write 20,000 schedules manually?” So you run into these problems of how do you write that many schedules? How do you deliver them? Make sure each one has the latest version of the workflow. If something fails, how do you recover? How do you monitor them? Perhaps there is cost optimization because if you spin up one cluster per workflow, soon you’ll be just spending all your time, spinning up and spinning down clusters. So we will give you a demo of local development of Apache Airflow and tell you about some new things that are coming right after.
So, as we mentioned before, if you’re a data engineer, you have used cases of impact analysis understanding, or you’re a data analyst, you want to understand the value. If you’re a metadata architect, you want to have PII information and so on. These goals require a few features. So the first is automated lineage computation. What that means is that if you have your code can delineate be automatically computed from there and can it be stored? So it’s a place where if you have a metadata system that separate and says, “Hey, you have to use these API calls to add lineage.” You’re never going to be up to speed. So, automated computation is super important. The second thing is it needs to be multi-level. So you might say, “I want to look across data sets and workflows across projects.”
Then once you are within a workflow, you might want to dive in and the workflow can be very big. So you want to pinpoint exactly where the change was made. Finally, you want to be able to search. You want to be able to search by columns and say, “Where all is this column used?” You might even have a user defined function and you’re like, “Where all is it used in all my workflows and data sets?” So you can search by authors, by columns, by even use of functions and expressions. So we will give this as a second demo. Now we’ll move on to low-code scheduling. So low-code scheduling is basically where you can use visual drag and drop to build Airflow workflows very, very quickly. So first focus is it has to give you rapid development. That means you can do quickly, you can do drag and drop and enter only the essential values.
Everything else is automated. Number two, what about the deployment? So let’s say you’ve deployed into a test environment, into a production environment and a workflow changes. Now you have to rebuild the jobs, redeploy the jobs. So, there has to be automated deployment, but it also has to be automated where all new redeployments can happen without any manual intervention. Finally, once you have done, okay, now you are able to develop. Now you’re able to deploy. Finally, you have to monitor. When you have to monitor, you want to be able to browse all your deployments, all your runs and see what’s working, what’s not, what is delayed. Perhaps later, which of the parts are the ones that are taking most resources? All right. Now just before the scheduler demo, I want to give a quick summary of how Prophecy works with your infrastructure.
You have your Git, you have your shell, you have your spot, your Airflow. So you have your infrastructure, which you are using currently. Prophecy adds a data engineering system on top. Of course, you’re getting the local Spark development, but the code is, the test, the lineage, everything is going on Git. Now we talked about this yesterday. We showed you a demo. Today, we are focusing on Visual Airflow Editor. So you have a Local Airflow Editor where you can similarly use drag and drop to create Airflow schedules. The port that you’re creating ends up on Git in completely open source format. So it’s a hundred percent open source code. It’s in Python, it’s exactly as you would handwrite it. The other thing we provide is you can just connect to an Airflow cluster and interactively run and see the Airflow schedule running live.
You can see the logs as it moves from step to step something you cannot do with Airflow without Prophecy. So we’ll show you a demo of that. Finally, we’ll show you a demo of metadata search and lineage. How you can search across all the columns, the data sets, and also we’ll show you lineage, how you can track values across your entire data engineering system. Now, the one thing that we also showed you yesterday is Gem Builder, which means that for Spark workflows, you can use the standard library, which is what we provide as the built-in gems, but then you can build your own. This is something that is available for Spark today, but will be available over the next two, three months toward the end of summer on Airflow as well. Of course, for Spark, we support both Scala and Python. For Airflow, it’ll just be Python.
All right. Now if you don’t want to worry about any of these details and you just want to use a simple low-code system, you can use that and not worry about any of these details. That makes yourself more productive. With that, let’s move on to the demo. In the demo, we’ll talk about schedule development, low-code Airflow development, where you’ll use drag and drop to build Airflow, workflows, and then see how you can run them interactively and deploy them to Airflow for production. Also, we will do demo of search and column level lineage at the end.

Maciej Szpakows…: Hi everyone. Thanks for coming. Yesterday, we have shown you how to build a complete end-to-end data engineering pipeline directly in Prophecy in under five minutes. Today what we’re going to do, is take that pipeline and schedule it using Prophecy’s low-code Airflow scheduler.
So here we are, again. This is the metadata screen where we have our product that we have created yesterday with the workflow. We can look at what transformation the workflow has, of course. That’s how it looks like, but actually let’s go ahead and create our schedule right now. So we can just create new customers schedule and open up the ID. There we go. So we start, of course, with a white canvas on which we can now start drawing in our schedule.
So we have all the gems right at the top that are representative essentially of the Airflow operators and we can use them to build schedule. What we would like to do is to actually take some files from S3 directory, whenever they show up and run a Spark workflow on them. Once our Spark workflow finishes, let’s clean up the S3 directories and send some email notifications. So I have here this S3 directory ready. It contains the source data directory, which is empty and this target data directory, empty as well. So without further ado, let’s actually get to creating our schedule. So we can start by dragging and dropping our sensor component. Let’s open it up. Let’s make sure it’s connected to the right path.
Now the sensor component, essentially what’s going to be doing it’s just going to be querying our S3 every 10 seconds and making sure that whenever files show up, the next operator runs. Okay. Awesome. Now, after that sensor runs, let’s drag and drop our workflow gem that’s going to trigger our workflow and it’s peak hour customer amounts workflow. For now, we’re going to be executing this on the EMR cluster. Of course, Databricks or other Spark environments are supported and we can additionally provide some workflow configurations. Now, after that workflow runs, let’s actually go ahead and run a simple script that will clean up our S3 directory. So I have here a small snippet that will just iterate over all the files within the source data and delete them one by one.
After that script has executed, I would like to of course, receive a success notification. So it’s just drag and drop the email gem and fill it up. Cool. Then of course, alternatively, if our workflow doesn’t succeed, we would like to receive a failure notification. So we can just set up our gem to be executed only if the previous task fails and let’s write our failure email. Okay. So just like that, we’ve built a complete schedule in under two minutes. What we can do right now is just go ahead and actually check whether the schedule does the right thing that we intended it to do. To do that, Prophecy provides a debug functionality for Airflow schedules. So essentially you have this small play button at the bottom corner. We can click that. That will start… that will generate the code for DAG, put it on Airflow, and schedule it just one time.
While this is happening, of course, we’ll be seeing logs showing up in real time for every single step. So we can see that right now, our jar is being built after that, the DAG file will be created. It’s going to be transferred into Airflow and our components will actually start executing. The alternative model of course, is that after we have finished with our debug round and we know that our schedule actually works, we can put it in production. Just with one click, we can deploy the complete Airflow workflow that would run all the CICD steps, place this Airflow code within our release branch, build the jars for it, and of course, make sure that the trends within Airflow, within the set timeframes that we can of course configure within our configurations. So we can set all the start dates and the intervals directly here.
Now Prophecy ID of course, is not just a visual editor. Behind the scenes, there is always a high quality code that you can just take and edit. So we can just click on this code button directly here and we see the DAG code for it generated. So this is our main.py file that actually decrees the definition of the DAG. It sets up all the operators, which are just simple functions. We can of course inspect all of those functions as well. So this is our check file, which creates just a S3 key sensor within the Airflow. Some email notifications, just email operators and of course the Spark run and the cleaning data steps.
All the configurations and other scripts are directly placed right within this directory as well. Now, going back to our visual scheduler, what we can see now during this interactive mode is that our sensor has already started executing. So if I hover over this icon, now I can see that it is running and I can see all the logs being available to me right here. So I can see that it’s already poking this S3 directory. So let’s actually go ahead and place some data there. So we can open it up. I have here a directory ready, just drag and drop the CSV file to it.
That’s the CSV file that our schedule should now pick up. So let’s go back to the ID and we can look at the logs what’s going to happen. So now we can see that just within a second, this sensor operator has finished successfully. We can see all the success logs. That component has been now ticked, meaning it has completed, and our workflow has started. So now the workflow is spinning and we can see that it has actually been already submitted to the EMR cluster. The job status should soon turn into pending. Now, while this is running, of course, all of this code that we have developed within the Prophecy ID is also visible on our GitHub that we have set up yesterday. So this is my summit session repository. If we just refresh it, we can see that now there is this schedule directory that was created four minutes ago. We can open it up and see our main.py file directly here and all the other co-dependencies, our graph files, and so on and so forth.
Now, of course, once you have many different schedules working in production, what you would like to do is also monitor those schedules and see how are they doing. Now doing that is very easy with Prophecy. We can just open up our metadata again and directly from here, switch to the deployment manager. Deployment manager shows you all the schedules that you have created that are currently scheduled and running, and you can monitor how well are they doing. So here we have some two other schedules that we’re running for already sometime.
We can see that some of them are doing well. Some of them have failed and we can see which workloads they’ve run and the historical runs for them. Directly from here, we can jump into the Airflow UI. If we want to inspect any details, see the code or check out the Airflow views. But of course, everything is contained within the Prophecy ID itself. So if I want to inspect each run, I can just go to the run tab and see each run, all the dates for them, when they’re finished, started, and also the logs. So I can quickly check what went right or what went wrong.
Now let’s go back to our ID. Oh, wonderful. So everything here already has finished. We can see the logs for our workflow that has already successfully competed, after that our script has completed, and one success notification should have been sent. The failure one did not complete because it was skipped as we can see here. So let’s actually open up my Gmail account and we can already see that just a second ago, I got an email with our workflow has finished notification, and we can also see the S3 directory and how it’s changed. So now our source data has disappeared and our target data directory contains the actual results of our Spark job. Wonderful. So our schedule has competed successfully and in just under five minutes, we’ll build this production ready schedule that looks through those three files, runs our workflow, and sends us appropriate notifications if something goes wrong.
Now, taking that the next step further, very often, your organization will have many different workflows and many different schedules running at the same time. They’re performing tremendous amount of transformations and operating on hundreds or even thousands of columns. What do you do then? How can you inspect your transformations? How can you make sure they’re all correct? Well, for that purpose, Prophecy provides its lineage functionality. I’m very excited to show you as well. So we have this project setup here, Spark Summit Demo, which has two workflows and few datasets, which those workflows are processing.
Now, normally to understand what’s going on here, you will need to dive into the code, see what the Spark read operations are, what the Spark write operations are, and figure out from the code what’s actually happening there. With Prophecy however, to inspect what this complete project is doing, you can actually just click on this small lineage icon directly within our data sets and jump within the lineage view that immediately shows you how this data set is being created, how is it used, and which workflows modify it. Of course, I can just keep jumping between datasets, just like that. I can see how they’re being created, what they use. All of that is tracked within the history on the left hand side. So I can just go back and forth.
I can inspect, of course the columns of each of my datasets. You can see them right here. If I would like, I can actually select a particular column and see how it’s being transformed. So if I, for instance, select the last name column within my customer’s dataset, I see all the transformations right here in this case, just downstream transformations. I can see this column is being changed because it’s highlighted as red within this Customer Amounts workflow and it is being written to my report. I can actually deep dive into this workflow and see how it is being transformed within it. So I can open up any of the components and inspect the code right here. The code is also automatically highlighted for me so that I know exactly which pieces of code are performing the transformation.
Now let’s go back a little bit and let’s think about two different examples when lineage becomes tremendously useful. One example is for instance, when we have PII data within our datasets. So very common example, there is a tremendous amount of customer data that you would not like to reveal or storing particular data targets. But right now, within my customer’s dataset, I have here this Customer SSN column. To understand how it’s being used and when it’s written, it’s as simple as just clicking on that column and I can directly see that it’s being written into my Customer SSN dataset. It is not however being created into my report. That’s great. So, downstream users won’t be able to see it and it is not used anywhere else. I can also inspect the actual workflow, see how it’s being transformed, if at all, and I can see that there is no transformation supplied on my SSN column. It’s just a simple pass through.
Cool. That will make the life of people who are checking the compliance and ensuring the workforce work well much, much easier. The alternative example is a much simpler debugging example. So as a developer, I have many columns that I need to make sure are correctly transformed. For instance, if I have a column amount that I would like to understand how it’s being created, now it’s as simple as just going to any particular dataset that I think might contain it. Choosing that column, I can see how it’s being modified. So I can see that on the upstream, the amount column is just treated as a sum of amounts. That is actually another sum that’s going on in the downstream, but I can just click into the workflow and inspect that right here, and they can see that there is an aggregate sum amount. Now that allows me to much easier and quicker debug through complex data pipelines.
Okay. So with that, we’ve just shown how easy it is to build complete schedules with low-code Airflow schedule within Prophecy and how easy it is to understand your complex data transformations with lineage. Back to you Raj.

Raj Bain: Thank you so much for the demo, Maciej. All right. So we saw how easy it is to do entire low-code data engineering with Prophecy. You saw local development of Spark workflows. Now you’ve seen local development of Airflow workflows and also for metadata. Also, you can see that all of them work well together. So with that, we want to talk about one more thing, which is a roadmap item. Tentatively, we’re calling it Scheduler Gen. This is something where we’ve had a lot of, like we said, we work with fortune 50 companies. The problem that we run into is this. They have 10,000, 20,000 workflows. They come to us and say, “Hey, do you want me to write all these workflows manually?” You already know what datasets each of the workflow is reading, what each one of them is writing and so on.
Why do I need to tell this information again? What about cost management, et cetera? So what we are working on is that, of course, from the workflow, since we have lineage, we can extract the plan. The plan that you see in the middle has datasets, which is being read by a workflow, then that writes a data set the next workflow picks it up and so on. So you can see all your workflows in one place. Now, once you do it, you cannot spin up one new cluster per workflow. You’re going to spend a lot of money just spinning up and spinning down clusters. So now what we can do is we can be very smart and segment them and say that, and here you can see the purple ones, the green ones, that you created three different segments.
In one segment, all the workflows are going to run on a single cluster. When creating these segments, we are going to consider the size of the cluster given the amount of time taken and making sure that we meet any [inaudible] that are possible, that we need to meet. So now once you do that, once you’ve done these segments from these segments, we can generate their flow schedules, which is run segment one, segment two. Of course, if something in segment one or segment two fails, you have to be able to restart from the middle. So there’s complexity there. Of course, you also need to have all the different workflows package into a single binary to be able to run on a single job cluster. So this is something that we are actively working on. If you want to… we’re looking for design partners. We have one, we’re looking for some more.
So if you want to try this and make sure that this works for you, come talk to us. We’ll work with you, make sure that we can over the next three to five months, build this with you and make sure it works for you. So with that, as a summary, again, Prophecy provides low-code data engineering, which makes it super easy. Each one of the pieces works very well with each other. It makes the users very productive on Spark, on Delta, and also on Airflow. The code that you generate is standardized and standardize on the standards you want, not just what we provide and you can be a GI use Git, CICD and all the modern software engineering practices. Finally, as I said, the complete solution, it means that you are not stitching together your own solution from the parts, but get a full product. With that, we are really happy to take questions now.

Raj Bains

Raj Bains

Raj Bains is the Founder, CEO of Prophecy.io - focused on Data Engineering on Spark. Previously, Raj was the product manager for Apache Hive at Hortonworks and knows ETL well. He has extensive experti...
Read more

Maciej Szpakowski

Maciej Szpakowski is the Co-Founder of Prophecy - Low code Data Engineering on Apache Spark & Airflow. He’s focused on building Prophecy’s unique Code = Visual IDE on Spark. Previously, he founded...
Read more