Gaining insights and knowledge from real-world health data (RWD), i.e., data acquired outside the context of randomized clinical trials, has been an area of continued opportunity for pharma organizations.
What is real-world data and real-world evidence – how it is generated, what value it drives for life sciences in general and what kind of analytics are performed.
What are some considerations and challenges related to data security, privacy, and industrialization of a big data platform hosted in the cloud.
How we leveraged Databricks to perform big data ingestion – advantages over native AWS Batch/Glue Examples of some of the advanced analytics use cases downstream that leveraged DB for RWE.
Note: This solution and one of the use case leveraging the solution won the 2020 Gartner Eye for Innovation award.
In this session watch:
Harini Gopalakrishnan, Director, Sanofi
Martin Longpre, Architect, Sanofi
Harini Gopalakr…: Hi everyone. My name is Harini Gopalakrishnan. So, I’m at Sanofi and I lead the technology strategy for what we call evidence generation and insights analytics, and Databricks is a key component of this ecosystem that we have built. And I’m happy to be sharing that with you today. Along with Martin Longpre, who is the solution architect who helped built [inaudible]. So, the topic itself is called Real World Evidence and Patient Analytics. It’s a use case that Sanofi, we take great pride in execution. And so for my first part of the talk is primarily going to be around explaining the context around what we mean by real world evidence and data. Why is it important for a life science, and then we switch over into some of the technical elements on how we repurpose Databricks for achieving these goals.
So, it’s going to be a quick presentation. We won’t overdose you with a lot of slides. And as you can see in the agenda, it’s going to be quite focused, and we will leave a lot of time for Q and A. But I hope at the end of the session, when you leave, you kind of understand what we mean by real world evidence and data, why privacy and security are important and how we had to customize Databricks to achieve some of these goals and are looking forward journey in the partnership together.
So, let’s define the problem. What is real world data and evidence and why is it important for us, right? Essentially real world data, or RWD as we will call it from now on, means all the data related to patients and healthcare that are not collected as part of what we normally call a clinical trial, which is generally called an RCT or a randomized clinical trial. So, potentially it’s all the data about a patient in the universe that is out there for us to tap into.
The very popular ones that we often use is referred to as electronic health records, that come from most of our visits to hospitals, or whatever the doctor documents. And then claims, which is basically a medical claim that you file with your insurance provider. In the future, we include data from devices, your Apple watches, weather apps or social media. So, it’s a very diverse and growing space. Anything that can help us inform about how the disease is progressing in the community or how patients deal with it is useful for us to mine insights from. And then real world evidence or RWE is essentially then doing analytics on this diverse data to infer a certain insight or what we call evidence.
Today in our company, we have about 130 terabytes of EHR and claims and with the multiple versions and transformations that we achieve, it adds up to about 2,000 terabytes of data that we manage in the versions and analytics platforms that we control. So, it’s a huge data that we deal with. It’s also unstructured because the kind of problems we’re mining for is also kind of theory. What does analysis in real world evidence really mean, right? So, we said we have a lot of data. So, what kind of analysis should we do with that? There’s typically two kinds of analytics. The analytics is what we call conventional, which is your regular statistics score that can be canned into a program. Traditionally, they use SAS, the programming language to manage it. Or it can be advanced analytics, meaning you’re using AI or machine learning to predict things.
And this is where our implementation makes the most sense. The typical advanced analytic methodology includes all the ones that you see on the screen, essentially supervised and unsupervised learning examples. And the data that we bring in, depending upon the need and use case that we mined the insight for its cut into purpose or what we call fit-for-purpose data. And this we typically call the analytical cohorts. And how do we use this in RWE? What are the different value chain in which we can use real world evidence? And why is it important for a pharmaceutical? Typically, RWE can influence a lot of aspects of stuff, pharmaceutical life cycle. It can go from research to approval of drugs to even helping commercial or [inaudible]. It helps, for example, find new indications, what we would call new disease areas that you hadn’t thought for just by looking at the nature of the data and doing some machine learning on it, you find hidden trends or patterns that you couldn’t discover before.
We can also find effectiveness of this drug in the real world or market. Remember, a clinical trial is normally done in a controlled setting. The number of patients in some cases are limited. Whereas in the real world, when you look at the gamut of insurance claims or EHR, we have a bit larger sized patient population. So, we can actually look at the performance of a drug in the market much better by looking at various parameters coming from this data. And it can also expand on labels, which is very important for the company where, for example, you have launched a drug for a certain set of approved indications. If you are able to find new indications, they can also help with expanding your label for these other unmet needs. So, which is for a company like Sanofi or a pharma, it’s actually pretty important use case.
And real life example would be if you take Pfizer, which was in the news couple of years ago. They were able to get a drug called Ibrance’s label extended for breast cancer in men, and they used only real world data without having to start another clinical trial. And because they were able to do that by looking at data that already exists and not launch another costly trial, it not only provides a cost benefit, but also in a case like breast cancer for men, which is a very, very rare condition, you’re not trying to recruit patients where it’s very difficult to find. Because it’s not something that you find a lot of patient population that are suffering from this disease. So, it also helps in this case to manage this rare disease [inaudible]. Of course, the key driver for this is generally high quality data.
And that’s why like any other industry today, the data providers play a key role in our ecosystem, even from an investment standpoint. As they say, data is the new oil and to manage such high quality data and allow people to do computer with clear security, traceability that external partners and regulators can trust, we need a platform that has certain leavers and pillars and that’s what we have built in Sanofi today.
Great. So, let’s actually look at a use case where we really applied this in a real life setting at Sanofi, if I could use the term real again. What you see is an indication finding use case that we based on machine learning techniques. We built an industrialized pipeline to do data engineering and to actually analyze the data and find what we call new insights. The data itself was varied in the slide, as we can describe it.
It’s a combination of insurance claims, data coming from labs and RX typically means our pharmacy information. We then cluster this into patient groups of a similar population. What we cluster is in fact the secret sauce. So, that’s why it’s not revealed, but essentially the pattern helps find a similar profile of patients together. And then in the end, we’re able to link it to a therapeutic area. In which case, what we essentially mean is, each cluster helps us identify what new indication that patient population could benefit from, which wasn’t very obvious to a naked eye or traditional analytics in the past. To do this, there are different constructs. You could ask, “can’t we just do it in any machine learning platform?” We will get into that a bit later, but there is an aspect of privacy preservation and de-identification. When we do this remember they’re doing it at a patient population level.
We don’t know who the patient is. And this is something of an ethical compliance that we adhere to, and to do all of this at an industrialized scale is where we have built this ecosystem. And we also leverage Databricks as one of the components. On the right you’ll see this use case in more detail, on what exactly we tried to achieve. The million patients that we analyzed and the 2,700 characteristics we were able to mine into. But what’s very important is this particular submission or the end point of this result was submitted last year for a Gartner award. And we ended up winning it and Gartner in fact called our ecosystem and the use case that came with the ecosystem as innovative use of an emerging technology within life science and healthcare.
So, it’s really pretty a good success story for us. We have been able to achieve this at scale because we are now able to reproduce these assets for other kinds of data and other kinds of indications. And it is not a standalone pipeline as well. And so these are some of the elements that have gone into building this platform and the journey that has shaped it for the last two years.
And one important point before we get into the details is because we deal with data that is coming from variety of sources, but always linked to a disease or patient, we have to take care in making sure that it is not possible for us to do it without their consent, or be able to identify it or identify the individual. So, we take a lot of pains in managing the data in a secure and traceable way and ensure that it’s governance is behind it. And the important theme of this presentation and the slide and the platform is always privacy preservation. And the data should not be used beyond the internet purpose. And the governance of this usage is a must. A lot of effort has gone into implementing that. And this is also where our partnership with Databricks is hopefully going to take us forward in some other dimensions.
So, what’s this architecture and implementation that we talked about previously a lot, right? We set the stage of what is the use case, why it’s important for pharma and what we achieve with it. So, let’s see what’s behind the scenes. So, any aspects of evidence generation ecosystem should have four pillars, right? And the pillars that we would like to mention here is one, data management. Doesn’t just mean that we do an ELT or an ETL on the data that bring that from the sources into a cloud. We need to put good practices on keeping the pipeline agile, we have frequent refreshes of data. So, we need to make sure that the version control is in place. And at the same time, we are able to code in a dynamic language, Python or even in some cases, Java, so that we have a lot of data engineers who could tap into this raw data and create their own engineering pipelines.
We have a component of analytics, as you could see, there is a diverse set of stakeholders. There are people who we prefer very traditional statistics like art or SAS, and we need people who would like to work in [inaudible]. So, we need to address both of these users in the ecosystem we built, keeping the security around the data uniform. The third is access control. Because a lot of the data is licensed or rebuy, there is a very important element of us knowing who’s accessing which data, whether they are doing the right things with the data they said they would do, and making sure that at any point in time, we can reproduce the analysis that was done. So, a lot of the implementation of our customization of the platforms have gone into ensuring strict access controls on data itself.
And the last is auditing and monitoring. It’s not enough if you control the data, we also need to control the transformations and the derived data sets and the publications coming out of it. So, it’s a full end-to-end lineage where, at any point in time, we can not only see the raw data, but we can also control the data access, the transformations to the right data sets, and finally the dashboards or any other form of analytic group that goes out from these insights.
And what does this really offer? So, what we have built is a powerful computer that actually, as you could see, could handle billions of rows of data today. We have a complete history of all data updates. We can go back to any version of the data. We have a good traceability and we can transform and capture the datasets. Robust security. And then we’re also able to manage metadata, reference data for this particular population that we are dealing with. It’s completely on the cloud, we’ve built it on a scalable data lake, which today is based on AWS. And this is the architecture and I will be passing it off to Martin in a second. But essentially the message here is that, as you can see, it’s a complex ecosystem. We deal with different kinds of platforms and tools because we have different kinds of users that access it.
We have people who just consume dashboards. So, we have traditional BI and full-stack dashboards. We have scientists who would like to do traditional studies, what I refer to as conventional statistics. So, we have a platform that helps them do that without having to write code and transformations in Python. And then we have our big set of advanced analytics data scientists, which is where Databricks operates the most. And here we would actually help users do with a good CICD, bring their own models, bring their own analysis at the same time, make sure that we have the lineage and the view that I talked about. All of this is backed by a central data lake with permissions. And then we also integrate with other internal and external systems. And we have a lot of external partners who also work on the system. That being said, Martin, over to you, to help deep dive into this technology and show what we did with Databricks.
Martin Longpre: Thank you, Harini. Hi everyone. My name is Martin Longpre. I’m the domain architect managing the data engineer and cloud expert team for the real world ecosystem. So, for the next 15 minutes I will detail our real world ecosystem infrastructure, hosted on AWS, and more particularly the integration of Databricks to support this real world ecosystem. This real world journey started since more than three years ago, and the main driver were to follow a news as much as possible. The Sanofi standards, the AWS cloud values and leverage automation as AWS diesels and factor trust code for the source problem.
And what Atellica really show by running is split in four different zone for better ingestion and secret process. We got a transient, raw, a trusted and a refined zone where the data will transition from zone to zone depending on each data type and usage. When the data is ingested, checked and cleaned, we are then pushing it to our SAS partner or make it available for our internal analytics and visualization application. By the way, we’re not providing any direct access to what the direction based on AWS S3 buckets. All our data, mostly in flat file using park and CSV are governed by the Hillwood access governance.
The only way to access those data is from analytic or visualization tool using our authentication service based on either groups and assume address IMOs. Finally, all our data lake data were used to being adjusted using most of AWS [inaudible] services like AWS Glue, AWS Batch and other services. But since the last five, six months, we’re now migrating all those pipelines into Databricks. Before going deeper on the next slide, let’s focus on our database integration. Today we have four different Databricks workspace. We got two sandbox, one in USA and one in EMEA and two production workspace, also one in US and one in India. We have regional Databricks workspace, two aspect, data science things based on the contract we have with our different data provider, or if any data have regional restriction. Today with those two main region, we are covering all our needs and waiting for solid use case to deploy the APAC region also.
Since we have used Terraform to mostly deploy 90, 95% of our Databricks workspace, a deployment in any AWS region is quite easy now. So, back to the slide. Where do we use Databricks? First, for exploratory use case, where our different data science need to run AI or ML workflow or use case that require GPU, custom libraries, and more. We also use Databricks for cross functional team project, shared between internal and external stakeholders. For flexibility, we let user manage their own cluster, size up and size down based on their needs and they’re restricted by work cluster policies. By the way, we didn’t open the other cluster mode type and driver for now, for better budget management. And last but not least, for data ingestion pipeline, meaning managed by mostly all my teams by migrating from AWS native services to Databricks, we have evaluated a minimum of 30% of improvement in cost and productivity by using Databricks.
So, since the last month, we’re also evaluating the usage of Delta format since why we be using mostly parquet file. We think the data should bring us a great adding value, but we’re still evaluating that with Databricks to make sure that that format can be made easily accessible to all our other consuming system. We also made available SQL analytic feature on our sandbox workspace to be evaluated by one of our business team who are looking for a quick adduct SQL query on our data link. So, now on the next two slides I will share with you guys, the customization we apply on our different governance requirements. Those Databricks customization are evolving [inaudible], following new Databricks feature or use case requirements.
The first one is for security. Passthrough was the main priority for us to easily manage automatically our data access and project restriction, now using AD groups for data access and Databricks groups for projects, and those groups are temporary. Our project, having a starting date and ending date at the end of the project, then we delete all the different Databricks groups and easy to manage like that and going back to Azure AD and request the deletion of the AD groups. Most of the DBFS path are not available for Databricks user except the different path needed for libraries installation and clustering a script, only for a specific file extension also. We are allowing the [inaudible] we are controlling as much as we can, the DBFS access system. Upload and download feature is also totally disabled based on our data restriction policies. We are providing internal file transfer tool service provider wizard when it’s needed. And we are providing also automatic 1.4 only reserved data and project storage access.
Then Gitlab integration. Our Gitlab enterprise integration was mandatory for my team at the data engineer level and also for our data science user to process their CICD pipeline and code freshening. For this requirement, what we did with implemented the Databricks REPOS feature by using a permanent get proxy cluster. We hope that Databricks will provide us a more robust feature in the next release to better fulfill this requirement, instead of having a twenty-four hours [inaudible] thing. And we are also using cluster policies for all project and the user. The main goal of those policies are provide better audit and monitoring KPIs for each project by using the cluster policy names suffixed parameter, meaning that to use a custom cluster, the user needs to be in a specific Databricks project groups and apply the suffix naming on this cluster name.
Those policies let us limit the cluster mode worker and driver type for better service usage and budget management, and to enforce a specific parameter like cluster termination, past reservation, and more and more. Then to end with the database customization, why not using on a regular basis the instance profile feature. Since we have almost 500 user registered on both workspace, sending up instance profile per project, or per user will be a huge effort, and relying on human action mostly. Passed through providers almost 90, 95% of our requirements in an automatic way. So, why using instance profile for specific requirement? And in those case, the cluster has set up for only one user by using the single mode or the standard mode cluster mode and by the project owner or by my team directly.
So, now I will show you more of a honorary quick tour, than a demo on Databricks user access to show you the different automatic mom point on folder access plus other things I’ve already shared during this slide doc. So, let’s, start this honorary quick tour of our Databricks workspace. I use the DMEA agent one. So, as you see we are using the single sign on feature. We keep also the admin login. So, if something happened with Azure AD at least we can connect with the single sign on and we are connecting. You are landing on the workspace Databricks landing page. We have all the menu on the left side. As I already said, you won’t see the SQL analytic feature over there. It’s only installed on our sandbox. So, we’ll show you it on sandbox. On sandbox we’re not using the SSO and we are not using our sole [inaudible].
So, we have one big instance profile where the data is also over there. So, user can test all the different feature, other stuff like this. So, if you go at the bottom here, you will see all the different menu with the SQL analytic one. So, we got one business team who are testing that new feature, and we’ll get back to that feature in the near future, I think. Then go back to the position one will show you the cluster set up. So, as I already said, we are having one Gitlab cluster set up per workspace. So, this is the only one that provides an access to our Gitlab enterprise system. So, I will show you how to connect to it. So, I will use a dummy REPOS. I might not have changed my token in the last couple of weeks.
It might not be able to work, but at least I already have a report within my workspace. We are really connected and using that Gitlab report to get all the different data from our Gitlab REPOS and make it available in Databricks. Now I go with the cluster policies. As you see, we got a lot of cluster in our system and each project as a custom project one will be a policies for each project. And we have the user cluster policy. This is for the Lambda user. So, we are setting up the different worker, the termination minute also, and each different note type and driver type available on the system. So, we don’t open the platform to all the different capabilities. We’ve got the password enabled and the IP table, and some of the AWS roll and metadata needed for the different connection.
Then we see the major difference between the custom Lambda policies is the customer name pattern. So, a user need to use the policy name in this cluster name to be able to get out the different custom set up and the different IP table. Now we’re sure the different user and copes to show the difference between the data groups and the project groups. I think it’s almost 500 user on the platform. So, if you go on the groups menu, you will see that all the APP groups are Azure AD groups. So, that was one of our permanent groups to access in the read only mode, the different access in our [inaudible]. And now the other one, the APP groups one. And now the other groups are Databricks groups starting with custom. So, those one are used to have the different data segments in those groups to add a different customization and be able to put their derive data on a kind of file or type of file they will need it for their different project to get there almost 500 users, 493.
We are also using the jobs module and you see, we had some digital CRM ingestion pipeline already in place in Databricks. So, we are migrating step-by-step our pipeline [inaudible] services. I data science I use that. The table menus that we use our data science as of now, so we don’t have any Databricks table creator. So, I would go with the quick demo to show the different months because all the different storage access are done automatically using the slash [inaudible] to when a user come in the platform, you can list all the different month available in the platform. So, we have data month, we have project month, and you have also the home user one. So, also when the user connect automatically, we’ll create for him, a specific folder on the S3 bucket. So, we’d be able to access it within the slash home, with his email.
So, then it will list all the different email home folder we get over there. I will find mine here [inaudible]. So, I will list my home folder. Should I have one or two folder over there? That’s it? I got the output data and SFTP upload data. But the main difference is those own folder are accessible only by user, so only one user can access it. They can’t share anything from that folder compared to the real world data one. So, now I can list the [inaudible] folder. This is only in remote, so I can find out the different release we’ll receive from CGM and I will list a different table available for that specific release. We have all the different tables over there. And for each path you will find the parquet five under the parquet table with the code lists. And then we see the snippy pocket firearm. It was a quick go to show you how the different way or data centers can access the data and the restriction we’re applying.
Thank you. So last, like when this technical part of the presentation, so we are working closely with Databricks to get continuous improvement on our Databricks stack. First one is our studio. We have between 20 and 30% of our data sciences using our studio, but since the storage in our studio is [inaudible], maybe EFS could provide a solution for that specific request and that we need to use only instance profile for project and data access. We are mostly refusing access for now and open it only for specific use case and manage it totally in my team.
The second most important request improvement we are working with Databricks is on propagation access right on BI data and data lineage. One main goal is to have the same access on data anywhere in Databricks workspace storage, and say that rather to access or manage using Passthrough, that’s totally perfect for Zac access in the remote only, but as soon as data science or a user will derive some big tables data, and copy it in other storage place, like the project month, the permission are lost and everyone can access the storage can access the data.
Databricks table we think migrate some solution for this requirement, but not the tele and our data science are not using Databricks table for now. As you saw in the quick demonstration. For data in the edge, we are looking in a way to have all action [inaudible] in each Databrick workflow, a kind of data center journey where we could see the data source use, data prep, transformation, versioning, and output. So, we can better manage each pipeline end-to-end in a kind of visual way. Maybe other improvement will be raised while using more and more Databricks, because we’re using Databricks since five or six months only, but as of today, while working more in a kind of partnership with Databricks, then only grazing our department. This is quite appreciative. But that’s it for me. So up to you, Harini. Thanks a lot.
Harini Gopalakr…: Thank you, Martin. And just as a wrap up of where we are today, as Martin said earlier. We are in a partnership with Databricks. It is a journey that’s going to continue. And we have valid requirements that we would expect from the product that today we satisfy outside the product. And there are things that are working already really well for our data scientists. But just as a recap, we started this journey three years ago, where it was a traditional warehouse and we moved that into a big data lake ecosystem managed by a couple of different cloud providers and SAS partners. And Databricks is a new entrant that we hope to work more with them on. We have helped move away from conventional analytics to more advanced analytics approaches, which leverages the true value of cloud and big data. So, it’s a huge change management as well.
And in this journey, because of where we are today, we are able to generate what we call evidences in scale and in industrialized manner. And one of the pinnacle of achievement is the external recognition that we get from agencies like Gartner. So, it’s been an interesting ride so far, and we hope to continue with in the future. As we wrap up, I hope you have taken at least couple of points from our interaction, and we hope to see you back next year. And just as a ending note, all the points expressed are our individual views. It doesn’t represent the Sanofi position, but it is something that we have learned and wanted to share with the set of Databricks enthusiast. Thank you.
"Harini leads Real World Evidence and Insights- Technology Product in Sanofi and oversaw the implementation of Sanofi’s RWE Strategy into an end to end big data analytics platform on the cloud. Sh...
Martin Longpre is the Medical solution architect who engineered the data flow on the cloud along with the integration of various components. He is a Computer Engineer and has led the implementation of...