Kyle Foreman leads the Scientific Computing team at the University of Washington’s Institute for Health Metrics and Evaluation, which combines the world’s largest global health data repository with cutting edge statistical techniques. IHME’s Global Burden of Disease has been cited thousands of times and guides funding and policy decisions in countries and for organizations like the Bill and Melinda Gates Foundation. Kyle studied neuroscience at Harvard before getting his PhD in biostatistics from Imperial College London. After a couple years at health technology startups, he returned to Seattle to pursue his passion for combining global health with data science.
The Institute for Health Metrics and Evaluation is extending the landmark Global Burden of Disease study to forecast what the future of the world's health might look like under a variety of scenarios. In order to do so, we are creating massively detailed simulations of the world, using Spark to distribute a workload that can produce up to a petabyte of outputs per run. One novel aspect of our forecasting platform is that it captures the complex interdependencies between different sociodemographic indicators, risk factors, diseases, and mortality. The strength of this approach is that it enables us to fully capture the emergent effects that result from small changes in a complex system. However, it also requires us to simulate this massive system simultaneously, foregoing previous approaches which rely on simple independent parallelization methods. To enable these massive simulations, we have built a Python module - SimBuilder - that builds up directed acyclic graphs from simple YAML files which describe nodes and edges using simple symbolic formulas. The DAG can then be evaluated using different backends - Pandas for small simulations and PySpark for tera- and petabyte sized runs. We have experimented extensively with various Spark data structures to find a good tradeoff between memory footprint, processing time, and ease of implementation. We have found that Spark DataFrames are the simplest to implement due to their similarity to Pandas, but they are inordinately slow at running our simulations. For high dimensional problems with balanced data, we can improve processing speed by several orders of magnitude by filling partitioned RDDs with multidimensional Numpy arrays and utilizing their vectorized array operations. I will describe the data structures we have found to work best, present benchmarks of their performance, and share how others can use our SimBuilder software to run similar simulations.Additional Reading: