But it was not this upbeat assessment that got health data into the news, it was a whiff of scandal - principally the sale of excerpts from individuals' health records to commercial companies without their knowledge1. Doubtless Edward Snowden’s revelations of snooping contributed to this anxiety about personal information.
The impact of this has been rapid and far-reaching. The body responsible, the Health & Social Care Information Centre (HSCIC) froze all data sharing, conducted an inquiry into what happened and promised to mend its ways. They asked the 596 recipients with elapsed contracts to delete the data, only to find that some copies have left UK and EU jurisdictions and at least one customer has no intention of complying.
As long as health records in the NHS have been electronic, there has been great caution to ensure they are secure. Stories are sometimes told of a time when doctors would pile patients’ paper records into their car to work on at home, before accidentally leaving them in strange places, but those days are long gone. Access to patients’ data, whether for the purposes of retrospective research, or auditing the quality of the care they received, is now controlled by government agencies and health care providers. However, the duty of confidentiality has to be balanced against improving the quality of care and discovering new treatments and risk factors for illnesses.
Now the Law Commission, responsible for recommending changes to the law in England, has published a report on data sharing. As Marion Oswald recently wrote in these pages, the report calls for a full law reform project 'to create a principled and clear legal structure for data sharing'. Initially, the remit was for data sharing among public sector bodies, but they soon realised that public sector functions are increasingly outsourced to private companies, and especially in healthcare, it is not clear who owns the data and what they can do with it.
Recently, the government in England updated the 'NHS Constitution' to reflect movement towards more private and voluntary sector provision of care. This gave prominence to protecting data but said very little about using it for research and audit. It remains unknown whether private or voluntary sector organisations will own their patients’ data, whether they can withhold it from other care providers or researchers, and whether they can sell it. To my surprise, nobody seems interested in finding out. A legal concept of data ownership would certainly help here.
The anonymisation problem
Part of the problem is the uncertainty about what risk of identification goes with various forms of data. Even without your name or any unique ID numbers, one might still be identified from data, whether deliberately and accidentally. One of the first jobs I ever had handling data included the prescriptions received by older people in care homes2. I puzzled over a questionable outlier in the age data from a certain nursing home, and later found out from a newspaper exactly who it was, her name and those of close relatives. She was the oldest person in the country and had just passed away. I have always remembered that uncomfortable situation of having inadvertently identified someone, even against my own will.
The case of I v Finland is important here: a nurse who was HIV-positive alleged that she had been identified and lost her job by her employers inappropriately accessing her medical records. On appeal all the way to the European Court of Human Rights, her claim was upheld that the hospital had not protected her data appropriately, even though there was no trace of who might have accessed it. Although she was not named in the data, in a small organisation it is easy to work out who someone is.
In the UK, legal precedent requires that the risk of re-identifying an individual be considered separately from whether the data were 'personal' or not before being anonymised.
In fact, it is often possible to identify someone, the only question being how unlikely that is. Even with the average desktop computer, millions of possible identities could be checked quickly, given the anonymous data and some matching named data. ‘Pseudonymised’ data3 that can be linked across different sources brings new research opportunities but increases this risk (note the optimism with data-linking that authors in April 2014’s Significance tell us that 'the life journeys of offenders can now be followed' and the salaries, bank accounts and childcare arrangements of their friends and families too). Also, we have known for decades that providing only aggregated counts in response to queries on a server is no protection at all to an intelligent attack, combining results from multiple queries4.
A lot is known about the risks of re-identification and different techniques to prevent it, even if there is no silver bullet at present. We can add or subtract one at random from counts of individuals in a table ('Barnardisation'), or we can pull outliers back to a threshold ('Winsorisation'), or replace their precise values with coarse bands, or perturb the data in such a way as to preserve pre-specified statistical features5. None of these is perfect, simply reducing the risk of re-identification6.
Studies of re-identification are frequently picked up in the press, and the risk is generally exaggerated. Daniel Barth-Jones's contributions to the Harvard Law School 'Bill Of Health' blog are an excellent overview. He points out that the more stringent methods to protect data come with their own problems ('unnecessary de-identification will mask important heterogeneities between subgroups or destroy the integrity of variance-covariance matrices for multivariate statistical methods') and that re-identifying, though possible, may be too expensive and complex to be practical: 'midnight dumpster diving to look for prescription bottles is likely to become the more economically viable approach to violating our neighbours’ privacy'.
The open data principle
Open data (putting it online for anyone to use) and transparency (making the numbers behind policy clear to the public) are current themes in politics in the UK and elsewhere. I believe that open data is the only principled goal for sharing public sector data. Everything else is a compromise carrying risk to individuals, which may yet be outweighed by potential benefit.
If it's not anonymised safely enough to be open, we should question very closely whether to share it at all. We should be aware of how to assess the risk of identification, and be honest about it - and crucially, where the risk was unanticipated at the time of data collection, we should seek explicit consent.
Ross Anderson, professor of security engineering at Cambridge, suggests that we should be treating these data in the same way that a large company or military organisation would treat their valuable data: share under strict conditions - with the possibility of prosecution for abuse - when there are benefits, and guard it carefully at all other times.
In the main, surveys of public attitudes suggest that, although a minority occupy either extreme, most people want their data to be used for the greater good, but do not want it to be used for commercial gain. However, we tend to be inconsistent in practice, happily handing over intimate details to Facebook but being concerned about spies watching us through our webcams.
The RSS’s new revised Code of Conduct uses a very precise - and in my opinion, excellent - form of words to describe our responsibilities: 'Enquiries involving human subjects should acquire ethical approval as appropriate and, as far as practicable, be based on the freely given informed consent of subjects. The identities of subjects should be avoided in data presentations wherever possible, and be kept confidential unless disclosure is permitted in law or consent for disclosure is explicitly obtained.' For me, the crucial clause is 'the identities of subjects should be avoided'. That is broader than just anonymising or pseudonymising - it implies responsibility beyond the processing of the data, including into how it is shared.
I hope that the law reform project will bear fruit with a practical, simple and future-proof structure, including criminal law applying penalties commensurate with the damage that can be done to individuals in the information age. Once this is in place, the RSS should lead the way in high standards for working with data, including being prepared to expel members found guilty under the law.
Statisticians and all involved in handling personal data that may be shared or queried by external parties should be aware of data security and encryption. There is a body of expertise in information rights and information security of which the great majority of us are not aware, so it is incumbent upon us to bring in such experts to advise us, and not simply to hope nothing goes wrong. Organisations carrying out any sharing or linkage of personal data would then be wise to engage a professionally accredited statistician (PStat in the USA, CStat in the UK) to protect their data and their reputation.
It's time to take this seriously. Not enough people tasked with data processing understand how to assess or mitigate risk in data sharing. There are a lot of misconceptions about the risk of re-identification in the media, the law needs to define ownership of data and to provide serious punishment for criminal abuse or misappropriation. Statisticians also need to get informed about this and offer their professional expertise to help. In fact, as a profession we stand to gain from this. Companies would hire a chartered accountant to avoid fines for breaking tax laws, so why not hire a chartered statistician to avoid fines for breaking data laws?
The author is a member of the subgroup of NHS England’s National Advisory Group on Clinical Audit and Confidential Enquiries considering the renewal of projects investigating quality and safety of care. This article reflects his personal and professional views, which are not necessarily those of NAGCAE, NHS England or HM Government.
- 1. Macfarlane A. 'Care.data: Bungled opportunity or unjustified intrusion' Radical Statistics (2014), issue 110, pp 60-64.
- 2. Grant RL et al. 'National sentinel clinical audit of evidence-based prescribing for older people: methodology and development' Journal of Evaluation in Clinical Practice; 8(2): 189-198.
- 3. Pseudonymisation is a process where the organisations supplying the data have a common process for adding a new ID number which replaces any personal identifiers. Because they add the same ID, a recipient can match different data sets together.
- 4. Denning DE, Denning PJ, Schwartz MD. 'The tracker: a threat to statistical database security' ACM Transactions on Database Systems (1979); 4(1): 76-96. DOI: 10.1145/320064.320069.
- 5. For example: Ting D, Fienberg SE, Trottini M. 'Random orthogonal matrix masking methodology for microdata release' International Journal of Information and Computer Security (2008); 2(1).
- 6. As measured by various algorithms, for example mu-ARGUS
- Thanks to Marion Oswald for her assistance with this article.