Configuration Driven Reporting On Large Dataset Using Apache Spark

May 27, 2021 04:25 PM (PT)

Download Slides

In financial world, petabytes of transactional data need to be stored, processed, distributed across global customers and partners in a secured, compliant and accurate way with high availability, resiliency and observability. In American Express, we need to generate hundreds of different kinds of reports and distribute to thousands of partners in different schedules based on billions of daily transactions. Our next generation reporting framework is a highly configurable enterprise framework that caters to different reporting needs with zero development. This reusable framework entails dynamic scheduling of partner-specific reports, transforming, aggregating and filtering the data into different dataframes using inbuilt as well as user-defined spark functions leveraging spark’s in memory and parallel processing capabilities. This also encompasses applying business rules and converting it into different formats by embedding template engines like FreeMarker and Mustache into the framework.

In this session watch:
Arvind Das, Senior Engineer, american express
Zheng Gu, Developer, american express

 

Transcript

Arvind Das: Hello. My name is Arvind. I’m senior engineer, American Express. I’m here with my colleague, Zheng. We are here to present and give an insight into a little framework which we have built within American Express which is a configuration driven reporting framework which generates report based on a dynamic input provided to it on a large dataset using Apache Spark as a code product.
Now, before we deep-dive into the topic, a quick word on AmEx. We are a financial services company which provides card and digital products ranging across financial services, business travels, network and merchant services. We provide digital services and products for all of these capabilities here in AmEx.
Now, jumping into the agenda, we will start with a quick introduction on the reporting framework. We will go through the need of why we need such a framework which works on our dynamic configuration based. We’ll go to the components, we’ll go to the design, we’ll understand what level of transformations we need in this kind of framework and how we can manage those transformation external to the actual framework.
Same applies for a template because the reports at the end of the day work on top of a template. How can we manage those templates and how can we use those template externally to the reporting framework so that we can come out with a reusable framework which again needs to work on top of big data?
This is a quick introduction to the framework. This framework works along with the dynamic scheduler which we have built which is external to the framework again and it internally does the transformation, aggregation filtering out big data and the reason we have used Spark is to leverage that in memory and the parallel processing capability of Spark internally to the engine.
Now, a quick glance on the statistics and why we feel a need for this kind of framework. Here in this division of AmEx, we have billions of records getting registered into the data link each day and we have requirement of various kind of reports. It’s a partner-specific report so we have to give that part partner flavor at the partner level configuration to the reports. We have generic reports again. It’s a generic report collecting lot of partners information.
We generate thousands of reports daily. We have hundreds of monthly reports so there is a bit of an aggregation on a daily basis to generate those monthly reports. The report can be a feed, can be a summary of an extraction so there is a lot of different kind of need on a report.
Now, if we want to cater to all these needs on a need by need basis, then we will have to write a lot of code so what we sat down and thought is “Can we create something like a framework or can we create something like a reusable component?” Now, that led us to the need, right? The pattern evolving.
When we go to the needs of the reporting, there was two pieces of pattern involving. One is we have different reports and feed needs. We have different kind of stakeholders, we have different partners who have different frequencies for their reports who have different input dataset for the report who have different aggregation rules for the reports and on top of all this, there is a different template.
We have PDFs template, we have normal text file as a template, we have very customized template as well in which the report would go. These are the different criterias for the report. At the same time, when we went through the requirement, what we saw is we have a common pattern evolving so when they tell a common pattern evolving, we had the input dataset in the same format. We always Use Parquet as our input dataset so when I look at my framework, I can go to a Parquet location and read that. The read part is a common pattern.
Now, there are reports in which along with the input dataset, I might have to look up to an external data storage and enhance the data so that’s a common pattern. Can I have a flag-based approach to tell yes for this report? I can look into external storage and can I give a lookup dataset name so that it becomes seamless?
Now, one thing which was constant in all our report was a sequence of transformation Can I pass this transformation rule dynamically into the framework? The framework’s only job would be to look into the transformation rule and apply the transformation rule on top of my dataset.
These are some of the constant common pattern which was evolving. Now at the end, I have a template. Can I provide that template as a file or as a configuration external to the report and then the report can just blindly apply the template on the dataset?
We have different configurations of parameters at one hand and we had a common pattern evolving so what we thought is, “Can we control these external parameters or the different parameters dynamically which is external to the reporting framework and run it on top of Apache Spark as a job that would give us a common reporting framework?”
This was the pattern, this was a team, this was the design thought which went into this architecture. Now, when we applied this and we talked to the architecture, this is what it came up with. This is our technical company. Now, if you see in the middle box, the middle dotted box is our core reporting form framework which is a reusable component.
We have a Spark job, we have a Java application return which runs on Apache Spark which expects to read a specification file. The heart and soul of this framework would be the specification file. This will guide the entire framework to do what it needs to do. Now, once it rates this, it knows the step of activities it will do.
It prepares the initial dataset, it looks up the data, it applies a set of transformation rules which is written in the reporting spec and it applies the template. The template is a file which is provided to it, right? Now, this is the core framework. Now, to provide that dynamism into this reporting framework, we had to keep this spec on the configuration external to the framework so that when the job spins up, it can go and look into the reporting spec.
We leveraged the HashiCorp Consul product which is our config management product. We have a solid pipeline for that which we leveraged in which we could push this spec external to our framework to a config location so that became a config management system which really drive the configurations for my framework.
Now, how do I spin up the job in a scheduled manner? Because each job is to generate a report so we leverage Kubernetes Scheduler app, we have a app running in our cloud platform which is backed by a Kubernetes scheduler. We had those partner information and partners seek schedules maintain in the scheduler which will redeck those schedule’s processes and spins up a Spark job.
Now while it spins up, based on the report it is spinning up, it would pass those dynamic reporting credentials to the Apache to the job dynamically so that the job knows which reporting spec am I looking at. It will pass that and that gives all information to my job and my job knows which location of my config product my spec has and it will go and read that and then it will use it.
Now, finally after doing all the stages and we will deep-dive into the stages later part of this presentation where we will give exactly what kind of rules we are looking at and how will we apply those rules to the framework. We have a full-fledged deep dive into that but it applies all those steps and then it applies a template on top of the data.
Once the template is applied on the data, that report is generated and it is published to our [S3 bucket] which we have which is basically the storage area for our reports for our downstream transmission. One quick thing which is relevant to mention here and again we’ll spend some time later part of the presentation is we have used a template engine which basically reads the template and which will apply the template on top of a dataset and we have used FreeMarker.
Again, we have evaluated it a lot since our language we had choose was Java, it was fitting well with us so we choose that product. These are basically the different components which gotten the dynamism into the configuration. Entire framework. Now, I’ll hand it over to my colleague, Zheng, who will give you a deep-dive into each of these components so we’ll spend some time on each of these components to understand how we are doing what we are doing.
Before that, a quick glance at the configuration file. Basically, it’s a JSON file which we are maintaining which has various information or various specifications defined starting from the reporting name. There is metadata like title, type ID which will be used for logging, auditing, tracking purposes to bring those uniqueness to those reports. We have a schema defined. Not all data in an input dataset is used in all reporting so we have a schema which is defined to retell that the subset of data which will be used.
We have lookup datasets which will basically tell the local dataset which needs to be used in my input data set then we have the transmission rules. Basically, these are the rules which will transform or which will change the dataset based on the reporting needs then we have a reporting template which is an external template which we have provided which the template engine users to create the report then we have a schema. The definition of the schema.
Now, throughout the course of the session now, we will refer to our configuration file and we will tell you how each of the elements of the configuration file is driving the different stages of the reporting. Without spending much time, I’ll hand it over to Zheng. Zheng, what are you for deep-dive?

Zheng Gu: Okay. Let’s go to the deep-dive session. The first step will be the data lookup stage. After we read the Parquet files from HDFS, we have our original DataFrame but for this report we want to generate, we just need to use a subset of columns that in the original DataFrame. That’s the reason why we have this apply schema stage.
You might have questions about why Parquet has so many columns but are our reports only is part of it so the reason is we need to view this Parquet to generate different kinds of reports of feeds. Each report of feeds requires different columns and also the schema here is just a list of column names that we can get from the configuration file. After getting the list of column names, we can build a SQL query from it.
Here’s one example. There are four columns in the original DataFrame and after applying the schema, you can see the schema over there. The DataFrame with two columns which are specified in the Spark SQL query is returned. In this way, we can reduce the size of DataFrame at the beginning which will also save some computing time and later on resources.
This is our apply schema stage and the next one will be data lookup stage. After we applied the schema, we may still find some new info which is needed by the report but not provided by the Parquet itself so that’s the reason why we need this stage but we can find how to look up and what to look up in the configuration file also.
We’re talking about the configuration file all the time so this is the most important thing in our framework. For example, here, we may need to look up from [Cassandra Taipo] and gather DataFrame. That’s the sample lookup in the slot and after we have the lookup DataFrame, we can join it with the original DataFrame based on the common key and have a new DataFrame created.
After this tab, we’ll have all the data that is needed for the report generation but still, we need to process the data and do some calculation because before you can show in the template. That’s the reason why we have this applying transformation rules stage because we need to aggregate the data at different levels.
Each time we do an aggregation group by several columns, a new DataFrame is created. A report may need several DataFrames. By the time transformation rules stage, we need to create DataFrames from one to many. Other than aggregation, we may also need a easier filter [inaudible] apply some rules to the data.
We can also guide the transformation rules from the configuration file like step one, step two, step three in the example. From the example, you can see the original DataFrame. We have 11 columns here and 11 rows here and from step one, we want to gather those with transaction ID smaller or equal to 10.
After executing step one, a new DataFrame is created and you can see from step two and step three, we want to aggregate the amount by different fields from step one’s DataFrame. For step two for example, we group by category ID and for step three, we group by a transaction type so it will return us with different values and different aggregation. This means we have different aggregations in different levels and all of these data framework created by the steps may be used in our template.
Also, we not only use the function that is provided by Spark itself but also we use some user-defined functions in our transformation rules. Here’s two main UDFs we’re using mainly. The first one is personalized. We basically use it to calculate the actual transaction amount. We calculated the transaction amount by guiding the value from the transaction amongst column and also there’s another column which is decimalization factor columns so then we just calculate the actual transaction amount with these two common values so that in these two parameters, one, the app amount which should be an integer value and the other is the decimalization factor.
Also, the second UDF we use mainly is signage. Basically, we just apply rules to change the transaction amount to positive or negative. All these UDFs we’re using are mainly to prepare the data before aggregation and also then the next step is to apply a template. We know that to generate a report, we need to apply the data we calculate using Spark to a template.
Finally, since we want to show the data in the template, that’s the reason why we need to apply the template. Here’s two example when to resolve. The first one is since we’re writing code in Java and we need to decide which Java template engine we should use and the second problem is how to transform the data so that it can be used in the template.
For the first question, we have researched on several template options. For example, Velocity, Thymeleaf, StringTemplate and FreeMarker. For Velocity, it actually meets our requirements to generate the report but we know that it will soon be deprecated by String So we didn’t choose it.
For Thymeleaf, actually it had the same functionalities as FreeMarker and Velocity but it uses a different syntax that claims to be a little cleaner and is also specifically support certain types and also it has a wide popularity but the problem is it does not release frequently. The most recent two releases have a two-year gap.
For StringTemplate, I noticed that the documentation there is not robust and the framework is not stable so we just don’t want to use it and finally, we decide to use FreeMarker as our Java template engine. It is a template engine for generic template and also supported by Apache Software Foundation and widely used across Apache projects and it has new frequent releases also. The latest one was in February 2021 and also it has good documentations and community so if you find a problem and search online, you easily find the answer. Let me introduce FreeMarker a little bit.
FreeMarker is a Java Template Engine to generate text-based reports on template and data. Template are written in the FreeMarker Template Language which is actually a simpler language we can use and we just need to prepare the data and then the FreeMarker display the prepared data using the template. In the template, you’re focusing on how to present the data and also the template you are focusing on how to prepare the data.
For this case, since we’re using Spark to prepare the data, here comes the second problem. How to transform the data so that it can show in the FreeMarker template. Our solution will be to transform the DataFrame to a dataset with Java [inaudible] and then transforming it to a list.
FreeMarker supports to traverse the list and show the data in the list. Here’s one example. From the template side, you can see that we can traverse the list provided as [inaudible] and each member in the list is a Java object. Previously, if it’s in the DataFrame, it will be a row and each of the [inaudible] can be referred by using the dot. After we send the data and the template to FreeMarker Template Engine and the final output is shown there. It’s bringing out all the necessary information and it brings out all the data inside from our original DataFrame that referred to. This our deep-dive of our reporting framework. Let me give it back to Arvind.

Arvind Das: Okay. Thanks, Zheng. Just to quickly summarize what we talked all through here and what was our purpose of showing it is if you can provide those configurations external to the framework, the framework will be reusable and here how we manage is we manage it to a Conflict Management System and we constantly look up for each reports into the Conflict Management System and let the conflict drive what the reporting framework does.
Couple of benefits we saw using it is because we made it a black box and we shipped it as a framework. I don’t want our build time reduced substantially because the framework is intact. All we have to do is we have to change the configuration file based on the reports getting generated.
Now, the other big benefit is the actual teams because there are a lot of different teams using lot of different type of reports. Being the report framework team, we build the framework and we ship it across and the actual team using it need not know the nuances of the reports.
We really don’t need a highly-skilled technical team with knowledge of Apache Spark and all those things to build a simple report and we were also able to process big data ranging from tens and millions of reports to get records in one report. We’re constantly striving to improve the performance we do. we do see certain bottlenecks in terms of the download aspects and the transmission aspects but constantly striving to improve those things.
Those are a couple of things challenges I will upfront suggest because there is a list piece involved. It comes with a challenge of downloading it but then if you can do it and if you can have those pointers and have the data evaluated and the resources and your underlining availability of the resources upfront plan… For any Spark based application, you will have to plan your resources way ahead so if you plan that ahead and you have those planned out, this framework will give you those benefits and supports which we want.
Thank you so much for listening to this presentation. Feel free to get back to us if you have any questions and we at AmEx are hiring for our needs here. Feel free to please log into the AmEx job portal and yes, thank you and bye-bye from me and Zheng here. Thank you.

Arvind Das

Arvind Das is a Senior engineer at American Express. He is currently leading a high motivated team leveraging Apache Spark to design the next generation of data pipelines & data processing for the Ame...
Read more

Zheng Gu

Zheng Gu is a Software Engineer with American Express. He’s passionate about building scalable system with Spark and end-to-end pipelines to schedule and deploy jobs which can visualize the whole pr...
Read more