Researchers hoping to use population-based administrative data for research are faced with significant governance challenges in the release of individual-level data for analysis. This invited session, organised by the Emerging Applications Section at the RSS conference, aimed to discuss these challenges and potential solutions, in the form of anonymization and synthetic data.
Three speakers gave presentations relating to the role of anonymization, and methodological and practical application of using synthetic data as a way of accessing individual-level administrative data for research: Elaine Mackey (Research Associate, University of Manchester) , Beata Nowok (Research Fellow, University of Edinburgh), and Robin Mitra (Lecturer, Lancaster University). Katie Harron (Assistant Professor, LSHTM) chaired the event.
Elaine Mackey began by describing the importance of functional anonymisation, ie anonymisation in the context of the data environment (the data, the people, and other data that an organisation may have access to). She discussed the need for balancing risk of disclosure and usability of data, and the problem that absolute anonymisation may lead to data being of little use for statistical analysis. She then described the Anonymisation Decision-making Framework as a basis for considering the technical, legal, social and ethical aspects of accessing data within the Administrative Data Research Network (ADRN).
Beata Nowok followed by discussing the potential of using synthetic data to widen the use of sensitive data for research. Although using synthetic data in place of real data may be overambitious in the context of real applications, she pointed out the value of synthetic data for preparatory work, for developing code, and for teaching purposes. She discussed the benefits in terms of cost-time savings, and preventing the need for repeated visits to data safe havens. She described the R package synthpop, and future plans to improve the efficiency of providing synthetic data to researchers.
Robin Mitra concluded the three talks by describing how synthetic data are generated, by modelling relationships between variables and replacing values in the data with synthetic values that are drawn from the statistical model. He described different applications of synthetic data, and discussed confidentiality in the context of large synthetic datasets that contain a mix of observed and imputed values.
Audience members contributed to a lively debate around the different purposes for which synthetic data could be used (including both teaching and research) and the particular challenges in using synthetic data for research (in that the data produced are only as good as the model used to generate the data). Speakers gave a clear picture of the value of administrative data for answering questions that require large sample sizes or detailed data on heard-to-reach populations, and for generating evidence with a high level of external validity and applicability for policy making. We also discussed how synthetic data and anonymisation are particularly relevant to recent data sharing initiatives such as the Census Transformation project, and the need for a clearer dialogue with the public about what is meant by anonymisation.