Massive amounts of data on human beings can now be analyzed and used for profit. Entire businesses are based on linking and selling data so that goods and services can be sold, so that political campaigns can be won and so that potential terrorists can be identified. Those massive amounts of data on human beings can also be harnessed to serve the public good: scientists can use 'big data' to do research that improves the lives of human beings, improves government services and reduce taxpayer costs.
In order to fulfill this vision, statisticians’ knowledge of how to do research on big data must move beyond its current infancy. It took decades of work, and many great statisticians, to develop a fully rigorous analytical framework for survey research; many great statisticians will be needed to address the analytical challenges posed by the new types of data served up to us by computer scientists. That means that the privacy challenges associated with that access must be addressed. If the institutional authorities that grant approvals do not have a framework within which to operate, their default position will be to deny access with profound negative consequences for the public good.
Our book seeks to answer a number of key questions. What is the legal and regulatory framework within which data are being collected, used, and shared? What does informed consent mean in the new environment? Do people 'own' the data that are collected on them?
What are researchers’ and institutions’ duties to protect the data held? How do we design effective studies within the context of big data with confidentiality concerns? How can we facilitate verifiability in findings? And what are the practical options? i.e. what can researchers tell the institutional authorities so that they have a basis for providing research access to these new types of data.
The first set of chapters probe the legal framework, and suggest that big data call into question any legal system that is based on informed consent, data anonymization and privacy collection. Katherine Strandburg opens the first section by noting that traditional regulatory tools for managing privacy, notice and consent, have failed to provide a viable market mechanism. She recommends that we move away from individualized assessments of the costs of privacy violations and that the collection of private data for monitoring purposes should have its privacy law strengthened.
Alessandro Acquisti points out the difficulty in relying on market mechanisms (like supermarkets paying for data) to solve the privacy challenges - consumers may value their own privacy in variously flawed ways because they have incomplete information, or too much information. Solon Barocas and Helen Nissenbaum are clear that since it is impossible to guarantee anonymity or even to assess the potential harm associated with data collection, that it is impossible to ask human subjects for informed consent. Paul Ohm provides an extremely thoughtful treatise on how to rethink traditional definitions of personally identifiable information. Such as the role of regulation in creating and policing walls between datasets and the importance of continual reminders of the ethics of big data research.
The second set of chapters provides a practical operational framework. The authors spell out the value of big data for the public good and walk through how to address practical access issues from a regulatory, technical and legal framework. Steven Koonin and Michael Holland focus on the value of big data for a very specific use case with which they are familiar, that of city management, namely reduced taxpayer cost and burden, greater transparency and less corruption, greater economic growth, and the potential to address problems of epidemics, climate change and pollution.
Robert Goerge provides very practical guidance gleaned from decades of building a child-centred data warehouse for Chicago and the state of Illinois. He identifies five key themes: (i) identifying the data to develop and access; (ii) developing capacity in state and local agencies (iii) presenting data in a way that is accessible and useful for agencies, (iv) keeping data secure and (v) building trust. Peter Elias describes the practical experiences in Europe, and the focus on developing a harmonized approach to legislation designed to provide individuals and organizations with what has become known as the ‘right to privacy’.
The next group of chapters in this section offers a set of very practical suggestions. Daniel Greenwood and Alex Pentland et al. argue for a 'New Deal on Data'. If, they say, data is the new oil of the economy, we need to build interstate highways to facilitate the movement of the data. They take this to mean the establishment of business, legal and technical standards, and, based on their experience at the MIT Media Lab, recommend building an open Personal Data Store (openPDS) personal cloud trust network and Living Informed Consent, where the user is entitled to know what data is being collected, by what entities and put in charge of sharing authorizations.
Carl Landwehr recommends technical solutions: access must be allowed through controls engineered into the data infrastructure analysis on encrypted files, we should build systems in which information flow, rather than access control is used to enforce policies and he identifies a framework that can be used to delineate the characteristics of subjects, objects and access modes.
John Wilbanks proposes a portable legal consent framework. He argues that traditional frameworks to permit data reuse have been left behind by the mix of advanced techniques for re-identification and cheap technologies for the creation of data about individuals. He advocates a commons-based approach that can be used to recruit individuals who understand the risks and benefits of data analysis and use.
The final set of chapters deal with the statistical framework. A major theme, not surprisingly, is the importance of valid inference – and the role of statisticians/access in building an understanding of how to do such inference. Another theme is the inadequacy of current statistical disclosure limitation approaches, which are based on a survey framework.
The final chapter calls for an entire new analytical framework. Frauke Kreuter and Roger Peng set the stage by noting that the core problem with big data is that statisticians need to build an understanding of the data-generating and data collection process, and that is why access is critical; coverage and quality issues can be identified and addressed. They also note that traditional analytical approaches are likely to be unsatisfactory, since an important feature of big data is the ability to examine different, targeted populations, which often have unique and easily re-identifiable characteristics.
Alan Karr and Jerome Reiter follow with a discussion of the challenges of applying traditional statistical disclosure control techniques to big data. They note that the core problem is that the assessment of risk is based on assumptions about what a possible intruder knows about the released data, whether the intruder is malicious and how easily someone can be identified. The former two are unknown in big data; the last is distressingly easy. They suggest that a viable way forward for big data access is an integrated system including (i) unrestricted access to highly redacted data combined (ii) remote access solutions, as well as (iii) verification servers that allow users to assess the quality of their inference quality.
The concluding chapter, by Cynthia Dwork, is the last precisely because she identifies an important new research agenda for privacy and confidentiality. She argues that big data mandates a mathematically rigorous theory of privacy, a theory amenable to measure – and minimize – cumulative privacy, as data are analyzed, re-analyzed, shared, and linked.
Any definition of privacy must include a measure of privacy loss, and data usage and so research with micro data should be accompanied by publication of the amount of privacy loss, that is, its privacy ‘price’. Her treatise on differential privacy provides just such a framework.
In sum, the book answers some pressing questions around data privacy and access. The book’s authors paint an intellectual landscape that includes legal, economic and statistical frameworks. The authors also identify new practical approaches that simultaneously maximize the utility of data access while minimizing information risk.
'Privacy, Big Data, and the Public Good: Frameworks for Engagement' is published by Cambridge University Press on June 12 2014.