Amir Kermany is a Health and Life Sciences Solutions Architect at Databricks, where he leverages his expertise in genomics and machine learning to help companies in the space to solve their problems in generating actionable insights from vast amounts of health related datasets.

Amir’s past positions include Sr. Staff Scientist at AncestryDNA, Sr. Data Scientist at Shopify, Postdoctoral Scholar at the Howard Hughes Medical Institute and the University of Montreal. He holds a PhD in Mathematical Biology, MSc in Electrical Engineering and BSc. in Physics.

SEIR model is a widely used model for simulating the spread of infectious diseases. In its simplest form, the SEIR model assumes that individuals in the population can assume any of the four states: Susceptible, Exposed, Infected and Recovered (or Removed), and the evolution of the system is modeled as a system of ordinary differential equations. Although this simple model performs well in modeling large dense populations, it does not capture population substructure and the effect of variation in interactions.

To address these issues, the general SEIR model models the population as a network where nodes are individuals and edges represent interactions between individuals. This model has attracted more attention during the Covid19 pandemic and there are python implementations that run the simulation on a single node.

In this talk, we discuss implementing the generalized SEIR model using Spark and graph analysis libraries such as GraphFrames and use stochastic simulation methods to predict the spread of Covid19 using Databricks.

[saisna20-sessions] [transcript-play-code]- Thank you all for joining today. My name is Amir Kermany I'm a Solutions Architect here at Databricks and I'm focusing on our Health and Life Sciences Practice. Today I will be talking about SEIR model, which is a framework to model the spread of infectious diseases in a closed population. And I will show how we can use mlflow and spark, within a Python environment to efficiently simulate the spread of COVID-19.

So, I really like this saying that, "All models are wrong, but some are useful." So, what are these models used for and in particular, in the case of Modeling Infectious Diseases.

The main use is for estimation, we would like to estimate fundamental parameters associated with the infectious diseases, such as the rate of spread. Everyone is to project the future behavior of the disease and in terms of the population being infected, how many people are going to be infected or in terms of fate the fatality rates for example, our projection of that, and also based on these, we would like to inform policy. So, by now, I change behavior, by now we have heard of flattening the curve which is directly rooted in this SEIR model So, SEIR model actually, it is rooted in the simpler model which is SIR, and this model assumes that, the individuals in the population are divided into three stetes, they can take any of the three states have been susceptible, or infected or recover from the disease or die. The first example of such model actually, was published in a series of papers in the late 20s and early 30s, which is more or less when you think about it is 10 years after the Spanish Flu Pandemic.

This SEIR model adds an extra state of being exposed. And this model actually is used for diseases that they have, there is an incubation period, meaning that there is a time in between the exposure and the time that infection happens, and that is perhaps valid for COVID-19.

So, in this model we have, individual (mumble) state being susceptible. it is a transitions to state of being exposed at the rate beta, which beta is their rate of transmission per contact for that individual. And then at a rate sigma, in exposed individual is transmission to be infected, which the rate of transmission to infection to being infected is the inverse of the incubation period of this disease. And then at a rate gamma, individuals who are infected, recover. And the recovery rate is the inverse of infectious period for COVID-19, W=we know that it's like around 14 days. Or individuals at a rate equal the fatality rate of the disease, they can be removed from this.

Also, it is possible that, recovered individuals actually, they don't maintain their immunity and they can go back to be susceptible. In this case we have another rate for going back to the state susceptible. Initially for COVID-19, there were Initial reports about the correlation study that we found that individuals who have been infected, they have been (mumble) infected again however, it turned out that this was and mainly due to the false positive rates. So, we can assume that individuals who recover they maintain their immunity for a while, and this is the assumption that we will be working on throughout this presentation.

So, based on these assumptions, we can write down a series of differential equations, so this is basically a dynamical system describing the evolution of the system in terms of how many people are at each state how many people are here susceptible infected or recover.

The assumptions for this, which is when we are writing it as system of ordinary differential equations. So, we are ignoring it's the chasticity and chance events in the system. To be able to do this we are assuming the population size is larger, it is constant. And for individuals being in any of these states is independent on the characteristics of individual such as age, or, comorbidities and other factors. Moreover, we assume that the spread of the disease is homogeneous within the population and there is no copulation substructure such as network effects in place.

Also, we can add the effects of testing, to this model meaning that you can ask for example, individuals who are exposed they can transmission to be detected exposed or detected infectious and that depends on the rate at which in a population we are testing the samples.

So, as I said, this is a dynamical system. You can use Python easily for the deterministic model. You can use Python to solve this system of equation numerically. so, there is no closed form solution for this so we always use a numerical methods to solve these, we are going to use a package implemented in Python called surplus it is on gitHub. This is a package that actually has both a deterministic framework and also stochastic framework that I will talk later about, which incorporates the impact of network effects for example, and allows you to deviate from these assumptions that we have for the deterministic now. Related to use show that how we can begin the database platform we can use mlflow and hyperopt for informing policy for example, we want to see what's the impact of closing down the economy or social distancing and see what each of these parameters have an impact on lets say fatality rate. And also I will talk about how we can use that spark to massively gonna simulate in parallel many stochastic paths and look at the average number of infections.

So, this is an example of writing running surplus model. So, easily you can initiate and instantiate a class of this stochastic model, you initialize the parameters, the default parameters are actually based on what we know about COVID-19 in terms of transmission rate and rate of reinfection rate of recovering. So, in this example, as you see in the curve on the left, so, if we are running the simulation, the simulation for a period of like 300 days, we see that up to like 16% of the population can get infected.

However, within this package, you can specify checkpoints, where that at a given time T zero you impose, let's say social distancing. And agree that you're showing social distancing for that period of time, you have already used beta, which is rate of transmission. And then look at the impact of that. And as you've seen the curve, so now that you have the social distancing in place for a period of, let's say, 50 days, you significantly, make the number of infections to come down. And this is an example of what we're talking about flattering the care.

So let's go see how we can use this model to perform the first task that they mentioned, which is to estimate the parameters of the system. So using the deterministic model, they can actually make approximations to get a closed form equation for the spread of disease. And then based on that we can use for example, classical tools such as curve_fit in scipy to do curve fitting and estimate those models. Another approach would be to use a heuristic approach, which I will show how we can do that here. So let's say that you want to estimate beta and UI.

First, let's just run the model. So the simulation based on the default parameters that we know from COVID-19. And in this case, I'm running it for the period comparable to the period that we have data for US population. So on the left plot, what you're seeing is the data that is included in data bricks, data sets under COVID. And which is based on Johns Hopkins Hospital data set. And this is showing the number of infections and deaths in the US, for this period of time. So if I run the simulation with the default parameters within the same period of time, this is what you see. So what you're seeing in blue here, on the right graph is the number of confirmed infections, and in orange, what you're seeing it is something that is predicted from the model. And this actually tracing really well, keeping in mind that the number of confirmed cases is always less than the number of actual infections, because confirmed cases are the ones that have been detected. So now if I want to estimate these values, one, thing I can do is, actually write a Python function that runs the simulation. And then after running the simulation, I get the number of infections or fatalities that they want, In this case, I'm using the number of fatalities based on the simulation. And then also I read the data from the actual US population. And what I would do I define a distance is the Euclidean distance between the path between the simulated data and the actual data and we find the loss function based on that and return the but the output of the function is the loss function. Which is the distance between the predicted number of deaths and the actual number. Next, I can use hyperopt, which is this package in Python that used a Bayesian approach for searching a grid of hyper parameters to minimize a cost function. So if I run that, in this case, I want to estimate beta and DUI. And as you see after running it in parallel with (mumble), I can get a very good approximation, based on the model in orange what you see is the predicted number of deaths based on what we get from the model. And in blue is the actual number of deaths that you get from the data. So this is again, this is a very heuristic approach of searching the grid of parameters for finding the best fit to them all.

Another example of using this map technique with hyperopt and with manage mlflow, let's say I want to see what's the impact of the (mumble) of an intervention? For example, I want to know, how early should we close down the, you know, a population like people social distancing for how long it should be, and what would possibly, or it should be in the terms of how much we want to reduce the transmission rate. And this is something again, I can specify as some hyper parameters. And then in my cost function that I'm defining based on this Python functionally defined, the cost function can be the total number of deaths, plus a cost associated with closing down the economy, like for example, the length of time that you're going to close this, and then next we can run it again what I'm running here is that I'm using mlflow.

I'm using hyperopt between data bricks environment with spot trials. But also I use mlflow to track all this different rounds of the simulation. Then if I look at the MLflow dashboard, I can look at, the impact of each of these parameters. For example, on the top left, you can see scatterplot of, that shows the impact of the length of time that you're closing down the economy. And on the X-axis, you have fatalities. On the X-axis, you have the length of time and the Y-axis you have fatalities. And as you see, the longer the length of time, the lower the fatality.

In the plot below actually, this is really better showing the impact of each of these parameters, the three parameters that went into this closing down exercise, if we now look at this, we see that the lowest cost which has been this fatality, is corresponds to closing down the economy earlier, the society earlier which is t (indistinct). As early as possible and as long as possible and also as severe as possible like with using the transmission rate, this is obviously a sanity check on the approach of this is something that is obvious, but you can imagine that in a real case scenario, you can have more complex scenarios that you have more complex cost functions that the impact of each of these parameters is not evident, but still you can use the same technique to describe the impact of each of these on your model.

And these two plots showing the actual curves corresponding to the least impactful strategy which is not closing down for a long time. And as you see a second wave would be severe compared to the one that you close it in a short time, a longer time and more severe and you have a less severe second welcome.

So, I mentioned the assumptions that we have in a stochastic versus deterministic solution. So the first one is population size is large and it's constant. The other one is that susceptibility is for example, independent of age and other things. But what we now know about COVID-19 is that highly like H, for example, is a big determinant in pathology and infection rates. So that is not an assumption that people want to work with. The other thing is that assumption that the rate of spread is independent of population structure. This is something I took from Washington Post, they have a very nice infographic showing the case in South Korea, how one individual who attended a church service actually was responsible for making a lot of other people infected. And this is a very good example of the network effect there some individuals with catalog contacted in an effort are obviously disproportionally contributed to the spread of the disease. So based on that, we can incorporate the impact of networks.

So within this surplus model, you can actually specify the interaction network. And the assumption that it is built on is that an individual in the population who gets exposure from the community as a whole or from an infected individual within the population within its network. So, as you save it probably assume that the probability P, the individual is contracting the disease from the community, which in this case is very similar to the deterministic model. And with the complementary probability of one minus P, the individual can get infected from a person that he or she is directly contacted contacting within it's her network or his network. As similar to the deterministic model, an individual who is exposed now transitions to being infected with a probability this time of sigma. And then similarly, the other transitions are occurring within the population for example, individual who is now infected in the network can get another individual expose and expose individual can go to being affected. And now an individual who is infected transitions to being recovered with the property gamma. So, that is the old, being removed from the system. So, this is actually in surplus package that I'm using. You can specify this network here and it comes with pre-built networks that you can specify two parameters that kind of controls the distribution of degrees of connectedness within the network. And in this case, also you can specify for example, that social distancing is more realistically described can be Specify by imposing a network that the average connectivity of individuals are the average GPA of individuals is a lot lower. So, this example is running the surplus in the stochastic mode, where the initial network distribution of degrees in a network is shown in the histogram that you see on the right top side of the top right graph.

Again social distancing is imposed by reducing the average degree of the network, which is at the bottom histogram that you see. And as you see this, when you impose this kind of social distancing, what you see is a flattening of the rate of infections. And then if you get back to open this, going back to the initial interaction network, you will you will see a peak but it goes down.

So when we are talking about stochastic simulations, as if you're familiar with anything else, stochasticity means that each realization of the simulation will be different than the other one. So most often, when we have stochastic simulations, we want to aggregate the results and over many runs of the simulation, and this is an embarrassingly parallel task. So what we would like to do, we would like to be able to easily distribute multiple simulations, and then look at the average path. So this is a very good use case that we can leverage UDF in spark so you can wrap your simulator to the function that you're simulating this case, it's your classic acoustic simulation, rather than the Python function and then you can use Python UDF, spark UDF to run on large many simulations.

I'm just giving this example that you're running into the given system parameters but the actual use case the value shows itself if you are lets say you want to look at a large number of parameters and run multiple simulations at the same time, that you can then feed those parameters into this UDF and run many simulations at the same time. And you spark aggregation methods, for example, taking average on data frames to find the mean. Now let's just see what would the impact of the network so what I'm interested, I'm going to run the simulation as in this case, I'm running the simplest socastic simulation at 20 times. And then I take the average of the numbers of predictive in the time series in terms of the number of infections that are predicted. But then what I'm going to vary between these different simulations is that I'm changing the structure of my network, so from the left on the left side, I have a network that people are the average degree is not a lot, to the right that I'm increasing the average degree of the distribution of degrees within the network, so people are more connected with each other. As you see in the case that the network is more of a tight network, we have a lot more deviation from what would be predicted from the deterministic in this case. And as I'm increasing the degree of the network I'm sorry, this can we go back one slide?

Okay so this is where I start again.

Yeah, no worries, it's a bit complicated okay. I'm starting now. So what I would like to know is to compare this stochastic predictions with the deterministic prediction for example, in this case, I'm running the simulation 20 times and the only thing I'm wearing varying between different runs is that I am changing the structure of the network. So that under, as you see, in the left, I have a network, which is tight and individuals are not a lot connected to average degrees around 11 to a network that individuals are very much connected. So you have people who are interacting a lot on average. As you see, on the left side, I have the most deviation from the predictions of the deterministic model. And that's expected because a deterministic model ignores the network structure. And as I'm increasing the degrees, I'm getting getting more and more close to what the deterministic model is predicting. In this case, however, I'm assuming a 5050 split between possibility of an individual being contracted due to interacting with the community as a whole compared and also 50% chance that individuals are actually getting constructed only within their network so this would be close friends or family

So, now if I change that balance, so that put more weight on the contraction from the coming from the network, which is that the individual is directly interacting with

the community as a whole. So, the equivalent would be imagine that we are closing down

workspaces people are working from home et cetera. In this case, what you see is that actually you have a lot more deviation even with the network, people are interacting a lot within the network, you have a lot of a lot more deviation from the deterministic model in this case. So, that actually you have a lot less infections. So, this is kind of simulation is showing that, yes, if people are staying within the there network, obviously, the total number of infections would be less. So, the other parameter now I can change my values of the size of the network if I increase that, it's not that that effect will go away still we will see this deviation from the deterministic model even for a larger population. So, this is something certain it's not about cause if you recall assumptions that population is large for deterministic case the population is assumed to be large. And the other thing is that we assume there is a population substructure and what it emphasizes that the deviation that we see from the deterministic case is mostly due to the structure of the network and not the size of the population. And as you see, the only effect that it has it is now it is more smooth to interpret this footage.

So in conclusion that I just wanted to show you that how you can use an implementation of this SEIR within Python two and combine it with using mlflow and hyperopt to run

the parallel simulations of both in the stochastic and deterministic setting, to run power simulations, to use, for example, that looking at the behavior to the impact of different policies, impact of different parameters on the spread of the disease and that's something that actually can make a useful projection of the infections. And also I showed that how we can use actually spark to parallelize the simulation. This is interesting cause usually when you're taking off spark you're mounting me up on a ETL tool that does a lot of work in parallel. But one of the good things about, using spark is that it can easily distribute its sample tasks such as simulations, among many machines, and it's very flexible. If you're looking at cloud environments, you actually can distribute on many machines, you can run a lot of simulations at the same time, and then use aggregations on the output and right away get the result in a really timely manner. And also in terms of running the simulation and experiments that we saw some interesting factors, for example, one of the things that we saw here is they actually imposed it that we are considering the network and substructure and a network structure. In SEIR class, we see significant deviation from what would be predicted from the deterministic model. Obviously, this is an example of one example. There are many other examples that we use, we have agency simulations for infectious diseases, there are other implementations of SEIR, or you can even use your own implementation and easily code in Python. But still, we can grab that and use these techniques in order to run experiments and just run things in a distributed manner.