Healthcare Claim Reimbursement using Apache Spark

Download Slides

Optum Inc helps hospitals accurately calculate the claim reimbursement, detect underpayment from the Insurance company. Optum receives millions of claims per day which needs to be evaluated in less than 8 hours and the results need to be sent back to the hospitals for revenue recovery purposes. There is a very low tolerance for delay and error in the claim processing pipeline because it saves millions of dollars in revenue for the hospitals. Most of the US hospitals run at a profit margin of 2%, without accurate claim processing, many hospitals won’t be able to survive. In the past 3 years, we have been replacing an Oracle-based system by one written in Apache Spark, Scala, and Java.

In this presentation, we will discuss

  1. The conversion of claim ingestion process from PLSQL on Oracle to Apache Spark which improved the performance by 50% and reduced the cost significantly.
  2. The design and POC have been done to replace the ‘claim reimbursement’ module from PLSQL to Spark.
    1. How using Spark opens up our solution space to accommodate both batch and streaming which was not viable with Oracle. Oracle is good for only batch processing with tons of trouble.
    2. How Spark is helping us write code once and run in a variety of ways which saves us development cost without compromising performance and scalability.
    3. How Spark helps us write better code which can be easily unit tested, integration tested on our IDE without needing any special infrastructure.
    4. How using Spark and the public cloud helps us reduce the cost of operation.
    5. How it enables us to enhance the claim processing pipeline with the application of machine learning which was pretty much impossible before with Oracle.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello everybody this is Salim Sayed. I work as a principal architect in Optum. Optum is a leader in healthcare services. Today we will be discussing about our journey to claim reimbursement rewrite using Spark.

Claim Reimbursement with Spark

Today we will discuss the brief business overview of the claim reimbursement system. Why we chose Spark to rewrite, our old system into the new system. The claim ETL rewrite using Spark. We’ll discuss that first, followed by the migration and then challenges. Then we will discuss the claim reimbursement rewrite system using Spark. Then, we will talk about the Delta-lake adoption. The benefits and performance. The gains that we have seen. Followed by expanding the horizon and tips for successful migration.

Here is a brief overview of claim reimbursement system. As you see on the top right corner at the zero level, there is a contract and there is a coverage. So the coverage is the health care coverage of a patient with the insurance company. That stipulates that how much is the patient responsibility, how much is the co-pay? No coincidence and what are the services which are covered or not?

Things like that you might know and the contract is the, contract document between the provider, which is the hospital and the payer, which is the insurance company. So the contract is behind the scenes that we never see, that talks about how much of a payment needs to be made, for a service done to the patient. So, there’re concept complexities in that contract.

So our software can use that contract to evaluate how much payment to be made for a given claim. So let us see a scenario here. On the left you see there is a patient and the patient walks into the hospital when he’s sick and offers the coverage document, which is the insurance card and then the patient gets admitted, gets treated and after all the care is provided, the patient is good now and he leaves the hospital. And then after that the hospital accumulates all the charges and services done to the patient. Prepares a claim for those services and sends that claim to the payer. The payer will evaluate this claim with the contract that is already negotiated and makes a reimbursement, and then that reimbursement in either the cheque, goes back to the hospital. There is a possibility that this reimbursement is wrong or underpaid and in that case, the hospital would be losing the revenue. So, there is a process that hospital can figure out that there is underpayment happened,. So they can make an appeal and as part of the appeal, the payer will evaluate that and then may adjust the payment. So essentially, this our software in Optum 360, does and and Captures under payment and helps the hospital meet the appeal. So you can say essentially it brings money, in the sense that there are no losses that are happening behind the scene, that you know that would go unnoticed, unless our software is placed in the hospital and which is working for them to recover the losses.

I just see the claim reimbursement system is a complex system, because of the complex reimbursement rules that are there in the US healthcare system.

Why Spark

Because there are so many payers,

like Medicare and Medicaid and many commercial payers and there is tons of negotiation goes behind the scene and tons of different rules are created to make the reimbursement. So, a system which is built on top of PL SQL and SQL and our DBMS kind of starts to fail, or kind of starts to show, scalability and performance issues, as the complexity of the rules increases or the volume increases or either both of them happens and that is why we chose Spark, to rewrite our system. So that So that we would get the performance and scalability out a Spark. It works as well, our experience it works for both high volume and low volume scenarios. It is applicable for, typically for any application, even though the three V’s of the Big Data not applicable. Those V’s are volume, velocity and variety. So, normally it is recommended that if you have those three V’s, then you need a bigger application, but if you do not have those, in my opinion, you’d still be okay adopting Spark. Spark is stable, is production grade, you can depend on it for writing enterprise data application. The API is easy and mature, very easy to adopt, and the tool set around it are also mature, so, it will help you from development to production support and maintenance. All phases of software can be developed. It can be cost effective, essentially because as comparing to IBM as you might, well, you might be paying a license fee. Here, there is no license fee, the open source file is completely free and and source of your software. There is a possibility of cost saving, if you can adopt dynamic scaling in a cloud environment, because Spark loves to do sources. So, if you run a long running cluster with all that much resource, then it’s going to continue, you know it is not going to be costly. So you will need to adopt dynamic scaling to adopt that, but there is a definitely possibility of reducing costs as compared to IBM’s.

The reasons that we chose Spark was, it is easy to adopt, because as you know, in other few concept to master and Spark does such a good job in handling them behind the scene, that you will not really be worried about using this distributed programming happening behind the… Distributed execution happening behind the scene. You just write plain simple code and Spark takes care of stuff, things like Lazy execution, distributed nature of execution will be happening behind the scenes for you, but you will not realize. So it’s pretty easy to adopt. The API is very fluent and it will it will ,it will, it will, it will be natural and easy for you to use it. Java and Python are a very common skill set. So, you can get somebody… It easy to get developers to start working on Spark. The whole development process can be ID based, because you know because Spark can run your test cases. Just like you’re learning any standards of those test cases, without you feeling like, there is a cluster behind the scene, which is executing your workflow. You don’t need any virtual machine to run your jobs in a development environment. Just like Hadoop, unlike Hadoop, in the olden days.

The code that you would write is going to be batch and streaming compliant. So you don’t have to write two sets of code for the same business logic. It is compatible with many systems like it has amazing set of connectors to connect to various data technologies, as well as there is other technologies, so compatibility is there. You would never feel left alone. There is an active community support, for example, you know, if you ask a question in Stack Overflow, you might get another response within a day or so. There’re tons of tutorials on data bricks, a blog and the documentation on the Spark’s official website is on. So that helps you speed up the process.

Let us look deeper into the ETL. The claim ETL rewrite process. So claim ETL is a system which kind of acquires the claims and then cleans them up and then does some business logic on top of it, and then prepares it for the reimbursement. So the first piece is, acquiring the claims the ETL.

Here is the system before and after. So, as you see on the left side, the before is a typical PL SQL system, where the claim comes as you know, as files. And then we have written a parser, which is the audits parser, which kind of parses the claims which are in EDA format. So, it parses them and stages them into some staging table and then we run PL SQL code to acquire data from the staging table. Clean them up, process them and then push them into final Oracle tables. And then from there onwards it goes to downstream processes, which is things like, assignment rule, the (indistinct) process the reimbursement process and things like that. Which are not shown in the diagram. And that system we rewrote in Spark, as you see on the after on the right hand side. You see the main difference is the (indistinct) block is replaced by as Spark and (indistinct) So in the new system, we acquired the claim files the same way as files and then we pass it to the same parser, the output is now acquired by the Spark as files, as department files and then we do all the business logic over there. And then we use market based data lake to run our to run our you know… to do history lookup and run all the business value out there. And then after it’s finished, the result goes back to the staging tables and then those staging tables, from then those staging tables, it goes back to the record table. So what I mean is Spark processes the claim files and uses that data lake to do a historical lookup and helps in processing. At the end of the processing, the result goes back to the data lake, as well as originally is pushed into the Oracle finals tables, through the staging tables. So data is pushed to the staging table and then from there we run some words sequel that which kind of merges the data from the staging table into the primary table. And then once the data lands in the primary tables, there are many downstream systems from here, which are dependent on the table, so they will pull it up from the Oracle table and handle it for the staff.

Let’s see the highlights of the claim ETL rewrite process. So we use Spark 2.4, the open source version Our data lake is parquet format, and we are working towards migrating it from parquet to Delta lake and we’ll learn more about the performance benefits of that toward the end of this presentation.

We use currently a Spark standalone cluster on premise. We’re planning to migrate to Azure public cloud. Sometime some moths down the line. And then we are also using tools like Zeppelin notebook and Spark Shell for debugging issues on the other Spark.

Let us discuss the gains and challenges that we have seen. As you see here, this enormous figure, the x axis is volume and the y axis is time. The Orange Box is for Spark law and then the gray box is for the Pl/SQL program. As you see, the Spark load is much stable. So the volume rose from 4 million to 20 million. Still does fine timing did not go up that much. Because it could scale, whereas the other PVSQL program You will see that as the volume grew, the time grew much higher, at the same time, if the volume will grow even more, then we have failure with load. We are probably low 30 So, as you see here from a medium to high volume scenario, Spark always beats in performance and scalability, the Pl/SQL system. What I’m not showing in this diagram, is a low volume scenario, let’s say half a million records kind of scenario, where the Pl/SQL load is actually faster than the Spark load.

Because they’re just you know, the Spark load, has some overhead of distributing the data and analyzing it and learning it.

So, in lower scenarios, Pl/SQL wins, but the difference in timing is like in minutes, so, it doesn’t really matter, that it is faster for low volume and extremely slow for high volume. So, in all cases, we felt like on the performance and scalability aspect, Spark will work over the traditional pl/SQL system.

Continuing on the gains, so you know, when as time passes by, right, the codebase and the technology that we have used kind of becomes old and loses its shine, So, so there may be time for and there’ll be technical debt mounting on the old codebase. So, as part of the rewrite, we kind of get rid of all the technical debt and the issues that we have in our code. So, that is the key. There is a possibility of saving cost, as I said before, by using dynamic scaling and then a side effect of this rewrite, is that now our data is placed in two places, like it is now in database as well as it is in data lake. So the data lake data is in all confidence is splittable and it’s available for you know, any processing which can use relevance, because it is splittable and composed. So, it is a great infrastructure for doing ad hoc analysis and machine learning kind of workload.

So that is one advantage that you can use for other things that you could not do with our DVR system.


Let’s look at the challenges. This is a typical challenge for any new system, right? If you write a new system, then there are no sub systems on the system. And you realize that, you know, the operation team and support team, they have built this tool set around it. Now you and they get used to the old system, so you use the larger system, is to the way you interact with the data set. So all of that needs to change. So you need to consider operationalizing, your codebase, which is much beyond the development of it. So you need to consider about all the side tools, all the processes and procedures, that needs to change. So, you need to be developing, your new tool sets to interact, telling people to use those tool sets.

Data lake is a little different from database, because data base has indexes, so it can really access data quickly, but the parque based data lake is not that quick when you access (indistinct) So it works extremely well for large volume, but if you’re debugging something, you want one record, then you might have to wait like 20 minutes, 30 minutes to just see the record comfortably. So that becomes challenging sometimes. And then people need to learn new skill to be able to use Zeppelin and Spark SQL, which is very much like SQL still logic.

Somewhat challenges are like you know, there are custom tools and scripts which are developed as I talked before. So that also that also needs to be refactored. Cost saving is possible, but only in a dynamic environment as a as a resource, right. So, if we have if you have a infrastructure which is not damaged, dynamically scalable with a billion seconds like monthly or yearly, then you would be incurring a lot of cost, because Spark needs a lot of resources and if you are maintaining those resources for a longer period of time, then you are painful. But if you are on the cloud, then there is a possibility in our minute wise billing or hourly billing. Then in that case you will not be feeling that comfortable enough (indistinct) to get the job done. So, so in that scenario there is going to be cost. So what I’m saying is it’s actually advantage that they use cost savings, but do not promise that cost saving until you interact. That is the challenge. The development process, like you may be having multiple teams in the development cycle. So, your one team they have experience by working on on the Spark project, other teams may not and you know, so but to adopt you need a wider skill set. So, you may also need to be in obtaining the development team, to be able to make contribution to this. So you need to take care of that as well. So the last point here, possible data consistency, is a very specific case here. What I mean to say here is that, here you saw the data is copied in two sources. Like into the data lake and the database and so, So, if the process depends on the in perfect sync, then there might be issues arising from that, because if a job fails in one system, but the push of a button does not fail, that could be a sync issue created. They could be out of sync and that may cause issues. So, if you are developing a technology along the line where data is duplicated, then you have to make sure that it is resilient, that failure in one system, is not affecting the other system. If you tie them down, to be perfectly synced, then there would be issues. There will be issues in all/some of the data.

So the first part was the claim ETL. Now once the ETL is done, the data goes to that, you know Those (indistinct) tables and then the reimbursement software kicks off and does the thing we must and this is the portion where we’ll be talking about the claim reimbursement rewrite using Spark. This is a very complicated system, as I was saying that contract and as contracts are defined using very complicated business rules. So because of the complexity, so we chose Spark, so that we can handle all the complexity, which could not be handled with typical SQL algorithms, because as the complexity (indistinct) of the sequels become too complicated and then the database, has problem in producing optimal plan and that causes a tons of problem with scalability and also no sterility. So, let’s see how we count on that.

So here’s a comparison of the whole M.U system. On the left hand side here you see the old system is receiving data from an entity called the Clearinghouse, which is the Optum’s clearinghouse for receiving claims from all the providers. So that those claims from Clearing House end up into a data base and then into a file system and database and ultimately comes to our system, where there is a set of PLSQL procedures and as well as some job or so is communism Magellan PL SQL, which gives the claim the investment plan and the reserve is pushed to a like a table, set of tables and then from there, it goes into a feed process and the downstream processes further. So this system is nearly done and it is nearly done in a broad way. Not just a PL SQL process was changed, the whole (indistinct) was changed, because we’re modernizing the whole data acquisition and processing of our (indistinct). So on the right, what we are doing here is, we are acquiring all the claims, through the CMTN house. But the Clearinghouse is little direct communication between the clearing house and our system. The communication is happening through Kafka, so we have defined Kafka in the two topics, so claims have been sent to a particular topic and then our application is listening to that topic, to receive the claims and this system is a streaming system. It receives the claims on Kafka topic. Processes them and then push the data back into another topic, which is also on Kafka. At the same time it also pushes the result back into a data lake, which is a data lake based data lake for Ad hoc analysis as well as it also pushes the data back into a data warehouse.

So let’s see in detail how this is.

Design Constraints

So, since this is a proprietary system, often which is used for claim we must not, we wanted to make sure that, the capability is exposed in many ways. Like for example, as a REST API and a streaming API, also batch compliant, it needed to be an inner source and shared library. We also wanted to be able to use a common domain driven design to tackle the complexity of development process.

So the cracks of this constraint is that, we wanted this core capability to be exposed and reused as many ways as possible. So we wanted as less dependency on any technology as much as possible. At the same time, we also wanted, scalability and performance, we would not want it to be constrained by (Inaudible) So here, I’m showing you like three versions of the integration.

So, in all the three versions that you see, the claim reimbursement jar. This is the the core capability or reimbursement and it is developed using Java and this can receive input in terms of planes or objects. So imagine the two inputs to it and we’ll see the poor in a little bit. So imagine this is a this is a shareable infrastructure. So this we can plug in to many different, delivery mechanisms to be able to achieve the claim reimbursement in various ways. So the top one is a streaming mechanism, where they showed you the data comes from the clearing house through Kafka and then we are using Spark to be able to stream this data in parallel and then convert that data in JSON format to domain objects. And then those domain objects are fed into the claim the reimbursement software jar. revisits into the Java API and then does the reimbursement… That does the reimbursement and the output goes back into, the Kafka que as well as to the Delta lake and into the data warehouse. Similarly, there’s a batch mechanism which is essentially almost saying as the streaming mechanism where the data is read from a different source, which is a data lake. Again, the data is converted by using data set API from the packet to a data object into a JSON object, domain object and then that is passed into the claim reimbursement software, APA. And then that does all the heavy lifting of reimbursing and then the output is being pushed into (indistinct) The third one I’m showing is a non Spark application. It is a typical a REST API. So we have requirements that the users or even other systems could be doing any live interaction. They could be standing it in live and expecting the reimbursement to be done quickly, returning back for that kind of delivery mechanism. The same core software, which is the Claim reimbursement jar. The same software is plugged in into as a REST API and and delivered… The securities are delivered over the website, or from one application to another application. So as you see in all the three different integration frameworks, the common is the Claim reimbursement API, which is the core software, whereas the data piece is handled by different pieces, for example, by Kafka or by Delta Lake and then the scaling is also a little bit different. So in the top two cases, Spark takes care of scaling and performance. Well, at the bottom, it will be different.

Sample Code

Let us see a sample code, which is streaming this code. On the top I’m showing that there are two things important over there, which is the claim reimbursement API and (indistinct) Essentially, you know, we’re reading, the claims from a Kafka queue and the claims are in JSON format. So they’re converting the claim, as demonstrated in line number 30. The claims are converted from a JSON format into a claim object using the data set API.

And as we know that for the reimbursement, we need to data sets that we need to claim and we also need the contract. So the contract is being read in line number 11 and nine from database directly. So the contracts are read and then broadcasted so that they are available in all the executors to be looked at. So once we get the claim in the contract, here, in line number 15 I’m showing you that as a map operation, we are simply calling that the Claim.Reimburser passing that in the contracts And as the managers, as you know, it’s all about collaboration there are committed to asking. And then that Claim.Reimbursement.calc gets the bulk of the business logic executed. So the responsibility of Spark here is to be able to, get the data in a parallel mechanism and scale the processing. The responsibility of the Claim.Reimburse mixed API is to get the business logic done. And the responsibility of the Kafka system is to be able to move data from one system to another and the three together, provide us the kind of scalability and complicity that we’re trying to achieve.

Implementation Highlights + Same claim reimbursement library is used across streaming, batch

Let us see the highlights of implementation. As I was talking, the core of the library is written, as a library in Java, so that we could expose it across many different technologies like Streaming, batch and REST API. So that business logic is written in Java. And we chose Java because it is compatible to both Java and Scala. So we could be writing Spark code on Scala and you know, just calling the library so that’s how I do. We specifically intentionally tried to avoid writing the business logic in Spark SQL.

Because once we use Spark SQL, then that software can only run Spark (indistinct) we could not run it at the REST API. So for that reason, we did not use Spark SQL to write the core logic, but we did use Spark SQL to be able to acquire data, translate the data, convert it into domain object, and then call API.

We heavily use data set API’s and if you’re using small data, if you’re using Java API from Scala, you can do that, but you might have to convert the objects from Scala to Java. You may even have to use converter API to be able to convert Scala collection API to Java collectionAPI and then you will be able to call the majorities. We use spot heavily for its scalability and in scalability and stability as well as, connectivity to many various data sources. So it is a central component of this architecture.

So we used data set API heavily, so that we could use the APA to do the operation.

We carefully designed the infrastructure in a way that all the components of the claim are placed together as an aggregate object. So that we do not have to do any joints to build a claim object. So claim object is pre built during the ETL process and which contains all the information of the claim just in one record or one object to avoid joints and then the contracts are read and broadcasted from database directly. So basically essentially the calculation process ultimately becomes a map of this, because you have the whole object right away available to you and you can just call the API and (indistinct) passes to objects without worrying about joint. That takes a long way into performance. So, you have to do a lot of planning to be able to design that object beforehand, but it is what…

Let us see the result between the old system the new system. So, this is a experiment, this is a performance run that we had on a full data lake. So, one of our data lakes contained 80 million claim records, which is like 100 gigabytes. So, we use two systems right, the old and new to see the (indistinct) So, the typical Claim Reimbursement system has like two steps. One is the reimbursement, like the you get the claim and you win the contract and you apply the business rules to calculate the reimbursement. So, that’s one thing. And then once the reimbursement is calculated, then step two is to push the data out into other sources, for example, in your data warehouse or even database.

(indistinct) data base so that it can be reported, it can be shown on the user interface. So there are two components to that and the reason I’m calling them outis because they have different scalability profile. For example, the first system, which is you know, running from the file system to file system, or running from a streaming source to another streaming source, that runs extremely fast. And that’s it is also extremely scalable. As you see 80 million claims were processed in just 86 minutes, and then the complexity is extremely high here, but it will just scale regularly well there and the throughput will be like, is like 1 million claims per minute. Whereas pushing data into a DBMS is not that fast. Actually here we were lucky that this operation is merely side operation. Because the nature of it, but if it had been a update operation or delete of data operation then it would have been much slower. So, even with the insert operation, it is almost, two times slower. So, what all this system gave us a throughput of almost 333,000 claims per minute of throughput and that was achieved with 20 virtual CPUs with 100 GB memory kind of hardware. If we compare that with our old system, then there is no exact one to one competitor, because we never would have run over systems on such a high volume. We could not say that “My database had 80 million records. Can you go ahead and calculate all the things.” We could not do that, but we have done 300,000 accounts per hour.

Even a million accounts per hour if we could calculate. Like we want (indistinct) 80 million at the same time, but if we can break it apart into chunks of in half a million, then we could calculate it inn the whole system. So that way if you compare the performance, the whole system is like 400,000 claims an hour and the new system is like 330,000 claims per minute So essentially, the new system is like 50 times faster than the old system.

And from a cost point of view, the new system is much, much better than old because, as I said, if you are on the cloud and we are going to be on cloud, but we’re not there yet. So, if you’re on the cloud then the typical hardware, which is like a D class software. D class, sorry, D class node. Excuse me having 20 CPUs, 140 GB, Premium Tier and Data Engineering kind of a package with Databricks which is almost $4 an hour. With that workload, we could be finishing 80 million claim calculations in 20 hours, plus we have to pay for the storage. But when in comparison, if we were using non DBMS, then there’s no direct competitor here, because, for example, Oracle on the cloud, do not provide dynamic allocation and dynamic scaling for non Exadata workloads. If you’re on plain simple, regular Oracle, then you are on a virtual machine, which cannnot be scaled up and down that easily.

So it cannot really, scale up and down the resources to save costs. That’s not possible there. In PostgreSQL, it is possible, but the scaling is not that graceful. Like scaling requires restarting your application and stuff like that and at the same time, we haven’t seen such a high volume go through. So there is no exact comparison here, but as you see, there is a possibility of cost saving. A tremendous amount of cost saving if you move all your complex workload into Spark and then use your IDBMS for reports and (indistinct) In that case you will not be spending high amount of resources and you will not be spending peak resources on database. So, you can scale down is there, save cost and spend some extra… Spend some cost on your Spark based workload, whenever you do.

Delta lake adoption

So that is the end of discussion for claim reimbursement direct.

Let’s move on to the next topic, which is Delta lake adoption. So, earlier as I was showing you we have currently (indistinct) data lake and we are working on replacing it by Delta lake based regalia. So, these are the performance in numbers and the notes that I have on a system.

So, we are using open source Data lake for the time being and not the Data bricks that manages Delta lake and this open source Delta lake is good enough for us. So, in the open source version you do not have Z-order. So, what we did is we partitioned the Delta lake by one key and we order the data by another key and we chose this partition key and the order key, consciously depending on the, most access key. So, if we are accessing our Deltalake most of the time by two keys, then those are the two keys we used for partitioning. So, by doing that what happens is, any look up, any kind of insert update on those keys are extremely fast. Whereas any other operation on any other keys is going to be the same perfomance. Same slow or fast as a (indistint) The’re a couple of notes here, if you are using the open source version outside of Delta lake environment, then the Z-order doesn’t work. Z_order is a great functionality where you can…

You can actually make your data lake perform for a couple of keys. Like, so without Z-order you can order your data lake by one key. So in that case, you know, access that one key is… Access by that one key is going to be fast, but with Z-order, you have the option of using more than one key. So, the difference is that you can actually have three four keys over there, but the Z-orders efficiency reduces, as the number of keys increases, so you will not really use too many keys there, but a couple is good.

The optimize command also works only in case of a Data bricks enrollment. So to get around that, you have to be using.. Rebuilding your Delta lake basically, the rebuilding it once in a while to achieve the equivalent of optimism. But we are planning to go into into the cloud pretty soon and then there we’re going to use Databricks so we would end up using optimism and Yoda, just that the timing.

And I was saying the keys for partitioning in order by you need to be carefully designing that because that impacts the performance a lot. So anything beyond those keys are still going to be in an equally slow…

Equally slow to access using Oculus develop, but access with these keys are going to be extremely fast.

So as you see here, all the queries on those two keys are extremely fast and here is a code sample, I’m showing you how to build your Delta lake on a (indistinct) You essentially read the data as part of the parque and then save that with the, Order Back clause and with a partition back clause as a data. So the region will be a data lake, which is going to be extremely fast. Let’s see how the performance looks like.

Performance Comparison

So here’s a performance comparison as you see in the first several rows, all the operations, insert, update, merge, join, all of them which are done on those keys are extremely fast. As captain you see a simple real operation was taking 18 minutes and it takes 20 seconds. It’s simple lead operation on one column, was taking three seconds, now it takes one seconds and then left up there, things like that, if you’re doing update merge, then that require the (indistinct) but now you can do those operations without revealing the data. So, that it’s extremely fast, one hour 10 minutes, compared to 370 seconds. – Joins us faster too. The things which are not fast is the last row. It does any operation which is… I mean which is not on those keys, then is going to be equal timeframe and you can always reduce this time by increasing the resources.

So and you may not be challenged by this in case most of the processing is not happening by those keys, but having a few keys which you used for, publishing it online.

Expanding the Horizon

So after the rewrites, we are now expanding our horizon. We are looking forward to using new technologies.

We are now not challenged by high volume operation If there is volume increase, then it becomes a matter of hardware scaling.. We’re not depending on the sequel optimization and at the mercy of our IDBMS exhibiting handling the complexity in order to handle Spark’s complexity to do well.

Without any actual performance issues. Integration with streaming workload is much bigger, much easier, as compared to in hardware adherence, applying machinery learning our database settings. Looking much much closer and then we are also looking forward to selling cost.

So you have some tips from the two magazines that we did. Which might come handy to you So, running spot on premise does not save cost. That is what I was saying before. So plan for cloud migration as a core component of your model. Then that way you can promise cost savings, which is very attractive to the business. Consider changing the production support procedures and other tool sets, which are not your direct part of your software, but which comes into play, as you deliver the software into production and other teams come into play possibly. Definitely, use Delta lake. Do not use the parquet based data lake at all. Because there is hardly any reason for that. Delta-lake is fast and it’s pretty much equivalent to operation with a parquet extention.

An optimum schema design will provide the biggest performance ga

– Spend some time on designing your schema. The optimum design schema will take you a long way into designing and performance application. So the tourists can spend some time over there. We used Dataset API to the large degree than as compared to a data frame API. The regions are we are willing to, complicated logic and we want compile time safety. Our team is good with Spark and Scala and Java code. So not everybody needs to know Spark syntax. They can do eternal code in Java and Scala syntax, if you’re using Data set API and then if you have any usable library, that you can just use, but one downside of it is an API that is a little bit slower as compared to compared to Data frame because it has to you know, easily analyze the data.

So sometimes we use actually a combination of both. So when it is business logic heavy, we use data set, but when you’re accessing data, very convenient.

Be cognizant about the whole architecture of the whole data pipeline. You might only be motiVating one piece. You will be able to bottle neck from that, but at the same time , you might be leaving the other person. So, when you promise for scalability and performance, consider the whole pipeline. Consider that you are hitting one, so, after that, is the waterline moving to something else. So, consider all that, before making an expectation. We’re happy to go around with open source Spark. So now I will not be scared to use it for production grade application, for enterprise application. Just go for it. It is a pretty stable application, pretty stable, software and framework. Even though you do not have those V’s, TVs or six V’s or big data application, you still be okay using a Spark data analyser. It will not disappoint you.

Thank you very much for attending my session. Please provide me your feedback and you can reach me to my email,

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Mohammed Salim Sayed


I have been developing software since 2001,at present I am working for Optum Inc on a medical claim processing system. The healthcare industry being old is full of monoliths. I have been helping to revamp these applications to achieve simplicity, integrability, scalability and performance so that development and maintenance of the software is painless and running cost is proportionate to its earning.On the personal front I care a lot about environment, equal opportunities and human rights.I wish to build a compassionate word through software. My Stack Overflow profile -