Providing Public Access to Confidential, Big Social Science Datasets
Abstract: Large-scale databases from the social, behavioral, and economic sciences offer enormous potential benefits to society. When made widely accessible, these databases facilitate advances in research and policy-making, enable students to develop skills at data analysis, and help ordinary citizens learn about their communities. However, as most stewards of social science data are acutely aware, wide-scale dissemination of such data can result in unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises--and in some instances laws--to protect data subjects' privacy and confidentiality. As the size, richness, and quality of social science data have increased, so too have the threats to confidentiality and the difficulty of the challenge of making such data widely available. In this talk, I outline a vision for disseminating large-scale social science data. In short, I believe a way forward for access to these data is an integrated system including (i) unrestricted access to highly redacted data, most likely some version of synthetic data, followed with (ii) means for approved researchers to access the confidential data via remote access solutions, glued together by (iii) verification servers that allow users to assess the quality of their inferences with the redacted data so as to be more efficient with their use (if necessary) of the remote data access. Throughout the talk, I highlight advances in each of the three components of the systems and indicate open research challenges.