Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

May 28, 2021 10:30 AM (PT)

Download Slides

EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.

To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:
 
1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.

2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.

At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.

In this session watch:
Sebastiaan Leysen, Data Platform Lead, delaware
David De Wilde, Data Science & Engineering Consultant, Delaware Consulting CVBA

 

Transcript

Sebastiaan Leys…: Hi everyone. My name is Sebastiaan Leysen, and together with my colleague, David De Wilde, I will be presenting the project that we have been working on at EFSA over the past year, where we have been in essence laying the foundation of an intelligent, event driven data platform. Both David and myself, we are working at Delaware, which is an IT consultant organization with firm roots in Belgium, but with a growing global presence as well, as you can see on the right of the screen. On this project I took on the role of data platform lead, while David took on the role of technical solution architect. So, David will be taking up the second part, which will be the deep dive into the Databricks solution. Now, I think just for the overview, the key objective I think, of this presentation is to share with you our perspective and lessons learned, on how to install and use Databricks.
It’s a very powerful platform that we can all agree on that. It covers a lot of analytical and data science use cases, but of course you have to make upfront good architectural decisions on how to organize your storage, how to organize your flows. And we just want to share with you our insights and best practices that we gathered from our process over the past year at EFSA. Before we do that, I will first be giving you an overview on who EFSA is, and give you a bit of an overview on the context of the project.
EFSA is the European Food Safety Authority. So, it’s the European agency that is responsible for providing independent, scientific advice on everything related to the food and feed safety in Europe. Of course, they work together with a lot of other national authorities and stakeholders, but if you would need to summarize EFSA in a nutshell, they’re basically responsible for making sure that everything that the European citizens are eating is considered safe. Now to understand the role or the position of EFSA in this process, you have to understand the distinction between what we call a risk assessment on the one hand, versus the risk management on the other hand.
Risk assessment is the role that EFSA is doing for everything related to the food and feed chain. So, it’s doing independent scientific advice. And of course, for them, they need to collect and analyze a lot of research data. On the other hand, you have the risk management, as you can see on the right, which is typically done by the European commission, European parliament and other stakeholders. But, they are responsible for making policy and making informed decisions. And in this case, they are doing that for everything related to food and feed safety, with the information that EFSA is consolidating and providing to them.
Now, I will be in the next few slides, giving you an overview on the core risk assessment process at EFSA, because I think it’s important for understanding the context on the project. So, there are three main steps. The first step is the receipt of the request which, as you can see on the left here, you have the three typical European bodies and they mandate EFSA to work on a particular risk assessment. So, they are basically saying you can free up time and resources to work on a particular scientific advice. So, the mandate is coming in, and then you have to know that within EFSA, EFSA is organized around 10 panels of external scientific experts, independent experts, and each of these panels is dedicated to a particular area within the food and feed chain. And of course, to each panel, we have also elate working groups of people who are doing the actual work, but it’s the panel who are overviewing the overall process.
So, when a mandate is coming in, let’s say it is, in this case assigned to the food contact materials. So, that’s the first part in receival of the request. Second part in the process is the actual risk assessment itself. And as you can see here, a lot of data is used to perform such an assessment. So, you have these scientific evidence, which is typically coming in via dossiers, that applicants have big organizations they submit to EFSA, and it contains a lot of documents and information describing those documents.
Of course, you have the expertise of the experts themself. You have the literature studies, but also the public opinion. So, sometimes EFSA [inaudible] use public consultations to ask feedback from the public. So, all of the information is used to eventually come up with the scientific advice for a particular topic. And that’s what you can see on the right. So, the outcome of a risk assessment is called the opinion or also the [inaudible], called the outdoor opinion. And initially it’s in draft, but there are a lot of meetings and feedback rounds, and eventually when the output is final, editing files is concluded, then the output is adopted by the panel, in this case the food contact materials.
Now, the third part, the last part in the step is adoption or publishing of the outputs. So, the opinion is shared back to the original requesters, in this case, the European commission or European parliament, for example, so that they can make legislation changes and make informed decisions based on the outcome of the risk assessment. And in parallel, the opinions are also published in the EFSAJOURNAL, which is a [inaudible] free public website, which is hosting all these scientific papers, that EFSA is publishing. So, that’s the nutshell of what EFSA is doing, in essence. Now, you have to know that in 2019, an interesting opportunity, as I can say, came by. At the European parliament, voted, in 2019, that they voted deregulation, which is basically imposing EFSA to become much more transparent. So, as part of deregulation, there were objectives for EFSA to comply with. In essence, they were demanding that you and I have a better view on what is happening, to come to the outputs or the opinions.
So, what are the pieces of data that are being used? How is the decision process happening, to come to the opinion? What meetings were taking place, et cetera. So, all the pieces have jointly constitute the opinion that is published by EFSA, how did it come to place? And EFSA, basically took it as an opportunity to review their internal processes, as well as their IT landscape, to embark on a bigger digital transformation, to comply with the points that were in the deregulation. And that’s, of course also where Delaware came in, to support [inaudible] size of partner in this journey. Then I come to the second part of the presentation, which is the high level solution architecture. So, to help EFSA comply, you can see that we worked on three main pillars.
The first pillar is the data platform. So, it’s a modern data platform, which is in essence, collecting and consolidating all pieces of data that are surrounding the risk assessment, both structured and unstructured data, big data volumes, small data sets. So, we have been thinking and designing and implementing with them a modern data platform to host all that information. Secondly, the integration pillar, which is enabling all the applications within EFSA, which are taking up a part of the whole process for the risk assessment, to communicate to each other in real time, using what we call the event-driven architecture paradigm. So, it is installing central event broker, which is [inaudible] the information between events, following a publisher subscribing mechanism. And thirdly, we have the OpenAir support tool, which is a public website, which is in essence, informing the public on everything related to the risk assessment. So, where we are in the life cycle, what pieces of data were being used, what experts did what, et cetera.
So, it recently went live, and is making available an enormous amount of data to the public. And in essence, it’s enabling [inaudible] security of EFSA’s work. Now, the best way for me to explain what we have been doing, is by comparing them to what is happening on the airport. And what you can see here is a control tower. And, if I think of a control tower on the airports, it is taking up several activities, like enabling the communication between airplanes, orchestrating the ground processes, validating the flight plans, monitoring all types of things, tracking the airplane movements in real time, scheduling arrivals and departures, et cetera.
Now, if I think on a data platform, and I compare it to what is happening on the airports, this is how I would think of it. So, the data platform is responsible for brokering information between applications, ideally in real time, orchestrating data of movements, but also validating data, as well as monitoring the health of the platform and the data flows, harmonizing and consolidating the data, as I said before, and also feeding the data to other parties and other applications, both internally and potentially also externally.
So, building a bit further on that analogy, if you think on the applications that are part of the whole risk assessment process in EFSA, as being the planes that are flying around in the air surrounding a control tower, the information that they share is basically brokered by the control tower. And that’s the first part, a real-time event driven communication. But secondly, all that communication we route, or we channel that information into a central knowledge base, a central data store, which is curating the data, harmonizing the data, and making sure that it can easily be fed to other applications. Now, if we translate it into the EFSA use case, this is the picture that we ended up [inaudible]. So, on the top, you can see the event driven architecture and the main applications that are… [inaudible] of the applications, I should say, that are involved in excess risk assessment process.
And so, what is happening at the top [inaudible], they are sharing events in real time, whenever something happens. So, let’s say if mandate is coming in and the mandate is registered in Salesforce, an event is triggered and other parties or systems are involved automatically… They’re informed automatically. And so, all those events, they are also channeled, as I said, into a central data store, which is running on Azure, and all the transformations that are taking place on that data store are enabled by Databricks, together with Delta Lake. And all the information and the truth that we are building there is a central truth, is then feeding into a variety of consumption use cases, as I call, of which you can see a [inaudible] below. So for example, the OpenAir support, which is a public website, but also an API portal, which I’ll come to in a second, as well as for example, an internal [inaudible] application, and of course, access management reporting, so that they also are aware high level of what is going on.
Now, the next few slides give you an example of the deliverables that we have been building. So, this is an example of the OpenAir support tool, and you can also find the link below here in the slide, if you want to navigate to it yourself. And actually, everything that you see on this website is generated by the platform and kept in sync automatically, based on the events and the data we get from the other systems. So, let’s say that the output is, or the opinion is adopted, then that’s an event that is triggered from Appian, which is the application managing the life cycle of risk assessment. The event is forwarded to Azure, and it is automatically propagated to the OpenAir support tool, and it will update the timeline in this case, as you can see. So, that’s one example. So, everything, as I said is updated by the data flow underneath, and also what we call the dissemination rules, which is the rules determining when the information should be publicly disclosed, are all implemented using that same framework.
Secondly, as what I included here, is a screenshot of our API developer portal, which is enabling EFSA to give the other developers access to the risk assessment information. So, what you see here is an API, a RESTful API, called GetMandates, which enables EFSA to… Or which enables other developers to retrieve the data in the Azure data store. So, it’s basically a data as a service philosophy, in which of course the API’s are secured. But once, as a developer, you are given access, you can use the data in your own application.
Then, if we zoom in a bit deeper into the technical components, to make it all work, basically this image should summarizes the generic pattern that we installed. So on the top, you can see the equivalent, or the machinery, if I may say, of the control tower, whereas below you can see the implementation and the components that we used for implementing the central data. So, as can be seen, it’s running on a Data Lake Storage Gen2, and all the transformations are [inaudible] by database, which we’ll discuss later on. But, so let’s take an example. Let’s say that a mandate is coming in, which is registered in Salesforce.
So, then on the left top, you have the application component in this diagram, let’s say it’s Salesforce. The mandate is created, they will publish an event mandate created, using the endpoint that is registered to API management, the event is triggered to Azure, and then it walks through a series of components to validate certain deals, like the header of the events, the payloads, as well as making sure that the event is routed to the [inaudible] subscribers, that’s the publish subscribe mechanism, so that other systems are automatically updated in real time when something happens.
In the same time, or in addition, we also queue all the events in an eventer, and then, using small micro or mini batches, every, let’s say 50 minutes to an hour, depending on the load, we clear a new batch of events and we process the data using Databricks and Delta Lake technology, in order to store the data in the, what we call canonical data model, as well as in the views, which will eventually also feed into the data marts, as we see on the right, to feed into the various consumption use cases that are at play. So, the OpenAir support tool, but also the other use cases like the developer portal or the access management reporting. So, that’s a bit, in a nutshell, the overall machine [inaudible].
Now, one final part that I want to emphasize, is how you organize yourself. So, as you can see at the bottom, we have a series of zones which are collecting different [inaudible]. Databricks typically refers to these as bronze, silver, and gold. The analogy that we came up with is, of course, the first zone, which is raw, which is storing the data as if we get it [inaudible] from in the native format, as we get it from the source system. The second zone is what we call the curated zone, which is storing the data in a clean, validated, standardized format. And we can say that curated data, we have high confidence in it. But, these first two zones, they are organized similarly. So, they’re organized around the systems which give you the data, so the source systems, while the final two zones, they’re organized around the information.
So, the third zone is what we call the economical data model or the reference data model, if you want, which is a shelf of reusable pieces of information. Think of it as a data warehouse, in which you have pieces of data that you can reuse to analyze and come up with new views, which is the final layer. So, the views is what we call data that is ready to be consumed by other systems or applications. So, it is optimized for [inaudible]. And we adhere to the principle of storing the views, typically, at least once in the data lake, using Delta Lake, if it’s structured information, but also keeping the data in sync to one or more data marts, depending on the use case at hand. So for example, for the OpenAir support tool, that application is powered by an Azure SQL database, which is also storing the same views as we have from the data lake.
I think overall, this architecture is one architecture that is capable of dealing with both big and small data volumes, both structured and unstructured data, which I think is a very powerful way of working with data. And that summarizes a bit, my part. So I’m happy now to give over the floor to my colleague, David, who will take up the second part of this presentation.

David De Wilde: Okay. Thank you, Sebastiaan. So, for the second part, I will go a little bit deeper in the technical implementation. So, to start, I will go to the technical challenges. So, our first challenge was to have a lot of tables. So, we have 350 tables, which is rather complicated to keep organized. On the other hand, we still want to have quick reload times. And of course, cost efficiency is always an important factor. Next to that, it’s important to keep a consistent way of working, although that we have a changing team, with people starting with different experience levels. You want to keep everything organized, so a central logging of your data pipelines, data governance and data cataloging is important. And the housekeeping of 350 tables can get really time consuming, so we also want to put a focus on that.
All of us are working on putting data structure in a data lakehouse, and we don’t want to get data [inaudible], so we want to keep everything structured. So, to achieve that, it’s also important to get your data pipelines organized and structured, and to achieve that goal, we came up with the framework, so it’s important to think, how do we structure everything? So, our framework in the main assumption, is that you want to split the business logic, which is the valuable part, as much from all the activities. So, we want to get rid off all non value adding activities from the core, which is business logic. And really work with the dataframe, as a central concept. And you will see that some of the other sessions, they will start from a data pipeline, both possibilities are okay, but we start from a dataframe. And then the non-value adding activities can be a logging, housekeeping optimization of your tables and everything else.
How did we implement this? We started from an object oriented framework. So, our framework consists of two class definitions. On the one hand, we have the DataLakeTable, that extends the dataframe. So, a DataLakeTable consists of a dataframe, together with some configuration properties and [inaudible] information. And next to that, we have a second class and a DataLake class, which groups a group of DataLakeTable objects, and also consists of the Direct Acyclic Graph, so the linens between the data lake tables and orchestration functionality. So here, we splits in the class definitions, which is a framework from the actual content, which are the data Lake tables. So, developers only need to implement the dataframe and the extra configuration properties, and the class definitions can be copied between projects and between clients.
If you then have a look how we work with this in practice, you can see that we work in the online environment of Databricks. So, we start with some parameters on the top of the notebook. Next to that, we have some transformations, as you can see in the second cell. Here, it’s very simple because it’s just an example, but in real life you have more transformations, of course. And then the whole framework is metadata driven. And you can see that we have some metadata regarding the orchestration, some regarding how we want to export data, and some regarding how to get catalog the data. If we now focusing a little bit more on the non-value adding activities. So, these are things that you need to do, but it’s not adding direct value. An important part is your CI/CD pipelines. We can see that for build phase it’s rather simple, we just loop over our notebooks, we create DataLakeTables from it, and then we group the DataLakeTables in a single data lake object.
And we can just start at a data lake object to disk, by calling the data lake, notsave method. And the deployment phase is an even simpler, because we just copied the file to our next environment. And with the datalake.load method, we load the data lake, and then we can run datalake.initialize, to create all Delta tables that are not existing in that environment, and the existing ones we can update, which will add columns if needed, or alter the properties. Loading data is also just calling datalake.run, to load new data in a data lake, and with datalake.exports, we can export data to different systems. The housekeeping is also rather simple, we can just use datalake.optimize, to optimize all Delta tables, keeping [inaudible] columns. And we have helper notebooks to visualize the DAG representation, or to, for example, export in the data catalog.
Then to really orchestrate a process, we make use of the Directed Acyclic Graph principle, as probably many of you know from AirFlow. For the EFSA project, it is really key, so I will go a little bit deeper in this. We have 350 tables, but still we want to reload as much as possible, so it’s not efficient to reload 350 tables every time, again. Especially, since we only expect a small amount of tables to get new events per batch.
So what we do, we first look at which source tables got new events, in this case, it’s our tables 2 and 4. And then we only select the descendants from those tables, and that are the tables that need to be reloaded. So for example, table 5, nothing has changed on table 1, so we don’t need to reload table five. We can skip a lot of tables. And this makes, that we only need to load a limited amount of tables. If we go back to the example from Sebastiaan, so a new mandate got created, in that case, we only need to load 40 tables instead of the full list of 350 tables. Of course, this makes it, we can run everything much more efficiently, as you can see here.
Then I want to focus a little bit on how we export data. Mainly to show you how we implemented our framework, as an example. So, we export data to different file formats and different databases. We can export to the SQL database, to the spark file formats. But since we are working in python, we can also use python libraries, and for example, we created a custom function to export to PDF files, via WeasyPrint. But, we can also export data over REST API calls. And everything is implemented in the framework, in a DataLakeTable class, so we only need to implement it once, and we can copy it between [inaudible] and clients. And then we can just use it by adding a few metadata properties. So, on the bottom right, you can see that it’s really just a few lines of metadata, to export this dataframe, both to PDF file and to an API call.
Then, how did we implement this and why it’s so important? So, it’s important that it’s implemented in the class definitions, so that you have one version of the code, and only one place that we need to maintain, and we reuse it everywhere. So, this makes it easier to do code improvements. For example, we recently changed from a normal JDBC export to a bulk JDBC, we only needed to change this code in one place, and we didn’t need to update 350 data pipelines.
Similarly, we switched from pyODBC for executing stark procedures on SQL, to the spark built-in JDBC driver, which is a very interesting case. A colleague of mine has written a nice blog post regarding this. You can find the link on the bottom of this slide. And of course, since we only have one version of the code, it’s also easier to add extra functionality. Recently, we also added table switching when we are overriding a table in SQL. So, those are things that’s easy to do, and you’re free to approve, because you can do changes in one place and everything gets updated automatically. And how we actually implemented this is, as you can see on the right, we have a lot of templated code, PySpark code and SQL comments. And on the right places, you can just fill in the metadata provided by the developer.
So, if you then have a second look at our technical challenges and how we solved everything, we can see, because of the DAG based loading of the subtree, we can efficiently and quickly reload 350 tables. Everything runs on a single job cluster for cost efficiency reasons. And, because our better assumption, we put structure in our data pipelines and only a bare minimum of coding is needed, because only the ETL coding is needed, it makes it easier for new people to start, and we also enforce a consistent way of working. Both the logging of the data pipelines and the data governance, is all dealt with by the framework. And doing the housekeeping boils down to just turning a single function call, instead of writing a whole lot of pipelines.
So, we can conclude that in this project, we have enriched the event driven architecture with a central data store to add more functionality for the customer. Next to that, it’s very important that you don’t lose control over your data pipelines, and you can do this by organizing your data flows in a structured tree. And lastly, it’s important to be lazy and use as much of your code as possible. And we did this by splitting our logic as much as possible, from our housekeeping activities. So finally, I want to thank EFSA for putting that trust in us Delaware, to help them realize this innovative digital transformation project. And I want to stress that both Sebastiaan and I are present in the chat, and that we will do our best to answer any questions that you have. And we also want to encourage you to start in discussion. Thank you for your attention, and have fun with the other sessions.

Sebastiaan Leysen

Sebastiaan Leysen graduated as an engineer from the TU Delft in 2015. He has a BSc degree in Computer Science and a MSc degree in Management of Technology. Sebastiaan started working at delaware i...
Read more

David De Wilde

David has a background in mechanical and biomedical engineering. However, he got bitten by the ‘data’ bug and switched over to the data analytics side. His experience ranges from data warehouse ti...
Read more