The goal of the 2020 Census is to count every person in the US, once, and in the correct place. The data created by the census will be used to apportion the US House of Representatives, to draw legislative districts, and distribute more than $675 billion in federal funds. One of the data challenges of the 2020 Census is to making high-quality data available for these purposes while protecting respondent confidentiality. We are doing this with differential privacy, a mathematical approach that allows us to balance the requirements for data accuracy and privacy protection. We use a custom-written application that uses Spark to perform roughly 2 million optimizations involving mixed integer linear programs, running on a cluster that typically has 4800 CPU cores and 74TB of RAM. In this talk, we will present the design of our Spark-based differential privacy application, and discuss the application monitoring systems that we built in Amazon’s GovCloud to monitor multiple clusters and thousands of application runs that were used to develop the Disclosure Avoidance System for the 2020 Census.
U.S. Census Bureau
Simson Garfinkel is the Senior Computer Scientist for Confidentiality and Data Access at the US Census Bureau. He holds seven US patents and has published more than 50 research articles in computer security and digital forensics. He is a fellow of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE), and a member of the National Association of Science Writers. As a journalist, he has written about science, technology, and technology policy in the popular press since 1983, and has won several national journalism awards.