2.0: why its 'big data' approach should be reconsidered

Written by Emmanuel Lazaridis on . Posted in Features

A public outcry followed the announcement last year that personal data from across the NHS were to be comprehensively collected, linked together and distributed. This is an important initiative that is floundering and needs a major rethink. An approach that elicits greater public participation, based on sampling of the general population plus registries for rare conditions and events - would enhance personal privacy and result in better medical science.

There is no denying that 2014 was a tough year for Launched by NHS England one year ago January, it quickly became clear that this initiative to assemble ‘a set of linked data from all NHS and social care settings to enable better commissioning, research and public health’ had been entered into without adequate input from either domain experts or the public at large.

This failure to led to creation of a Advisory Group, involvement of the Independent Information Governance Oversight Panel (IIGOP) in an oversight role, and greater public outreach by way of a ‘listening exercise’ and a pilot programme in four clinical commissioning groups. Although NHS England continues to play a role in the direction of, responsibility for its sound development and implementation rests with the Health and Social Care Information Centre (HSCIC).

I first raised the question of whether any alternative to the ‘big data’ comprehensive data collection design of could incorporate a statistical sampling approach at an RSS Social Statistics section meeting last year. I subsequently raised this with NHS England and the HSCIC, first in the context of the public consultation on, then by direct correspondence circulated also to the chair and select members of the IIGOP, the Advisory Group and its Programme Board. I think it is fair to say that no commitment to seriously consider any alternative design has been forthcoming.

I wrote this article in the hope that the public bodies responsible for the programme will finally undertake to seriously consider alternatives to comprehensive data collection. In my view an alternative design, ‘ 2.0’, would address many of its problems (both perceived and real) while retaining or improving on its oft-touted potential benefits.

An alternative design

The first step to a better design is to recognise that the programme straddles two areas of inferential interest. The first concerns the common condition or event, and the second concerns the rare. Health research has always approached the study of common and rare characteristics differently. Broadly speaking, common characteristics are studied using sampling, and rare ones are studied using registries.

There are at least three good reasons why this distinction should not be lost in the era of ‘big data’ in medicine. First, useful inference on common conditions or events does not require comprehensive data collection at any interesting level of aggregation. Second, a rare characteristic is always more sensitive than a common one, not the least because a person with a rare characteristic is more identifiable by way of that characteristic than is a person without. Third, people get excited about rare conditions and events in ways that they don't about common ones. 2.0 would distinguish between rare and common characteristics. It would approach them differently, using the inherent differences between them as principles of design. In essence, it would be a programme with two independent arms.

One arm of 2.0 would sample from the population of NHS users to draw inference about common conditions and events. It is for the HSCIC to choose the sampling strategy that might provide the best data, a quite tractable problem. They even have a decent (but not perfect) basis for a sampling frame in the NHS number. Common outcome measures that are used for purposes of commissioning or quality improvement would be (continuously) monitored under the sampling arm of 2.0. This reflects standard practice.

Because a sample will always suffice to draw reliable conclusions on common characteristics at any interesting level of aggregation, collection of more data than might be included in a reasonable sample is unjustifiable under the first and third Caldicott principles1. Sampling may not be the only way to achieve compliance with these principles, but it is by far the easiest. Tim Harford spoke eloquently at the RSS Conference in Sheffield last year against the notion that more data necessarily leads to better information.

Sampling increases privacy

In addition to its technical benefits, sampling increases privacy. The HSCIC currently seeks to ensure personal privacy (i.e., to decrease the risk of re-identification of individuals or the misuse of personal data) by imposing stringent electronic security measures. The problem with this approach is that it takes only one mistake (or intentional security breach) to destroy any assurance of privacy, where the data are inherently identifiable.

Sampling automatically increases everyone's privacy because, barring unauthorised access or intentional misuse of direct personal identifiers, it is unknown who is in the sample, and who is out. Even restriction to a 5% sample of NHS users and their records, still a very large set of data, would substantially increase the personal privacy of all. Information about common features can then be more readily shared with the public at the finest levels of granularity, underlining the need to treat common and rare characteristics differently.

How wider release of health data can be enabled is a goal of my ongoing work with Robin Mitra (University of Southampton) and Lee Min Cherng (Universiti Tunku Abdul Rahman, Malaysia). We are evaluating the use of targeted partial data synthesis as a means of de-personalising personal data. Initial results of our work that I presented at the RSS Conference suggest that only a very small amount of data synthesis is needed, where common conditions are concerned, to remove data altogether from the ambit of the Data Protection Act 1996, while high dimensional structure is largely preserved.

A treasury of registries

The second arm of 2.0 would construct a treasury of registries for the study of rare conditions and events. Whereas many countries can do national sampling to learn about common health or care matters, the UK is almost uniquely positioned to be able to radically alter the way that registries of rare conditions or events are proposed, populated and monitored.

What I envision is a participatory system that would engage professionals and the public in the identification of interesting rare characteristics. This would enable thousands of rare conditions and events to be monitored and provide people with an accessible and transparent consent mechanism. Rare conditions, new health technologies and unusual outcome measures that are used for purposes of commissioning or quality improvement would be monitored under this registries arm.

Public engagement in the identification of important rare conditions or events is needed because crowd sourcing the identification process is (arguably) at least as likely to result in the identification of important but rare health or care matters for further study, as would machine learning techniques applied to ‘big data’. I can't prove that this is so, but neither have I seen evidence that data algorithmic techniques are able to outperform humans in this regards (not at this time at least).

Crucially, the identification of interesting rare conditions or events often depends on making observations that are inadequately coded for in existing data, whose informational content is not necessarily available to data dependent algorithms. Where the informational content is reflected in the data, I accept that data mining of the national sample would be likely to find interesting but previously unknown conditions of substantial prevalence. However, in the identification of rare ones, the human element – essentially a flexible form of supervised learning – is likely to outperform.

A genuine public engagement

Active engagement of the public in the joint enterprise of improving health is a vital element that is missing from the programme. Real public engagement in 2.0 would likely lead to better medicine faster.

Many have questioned whether NHS users should be allowed to voluntarily ‘opt in’ to, or only to ‘opt out’. However, the public discussion of consent has been limited by an implicit assumption that comprehensive data collection is the only possible way forward. Relaxation of this assumption permits hybrid approaches to be considered.

I envision that NHS users would have the option to opt out of identifiable participation in the sampling arm of Even if substantial numbers of people were to exercise this option, inference in the sampling arm should not be affected because, crucially, aggregate data would still be available from health care providers under existing minimum consent to treat standards.

On the other hand, there are at least two strong reasons why participation in the rare condition or event registries should be opt in only. First, the greater identifiability of people with rare characteristics implies that greater care should be taken to achieve true informed consent for the reuse of their personal data. Second, opt in participation is entirely achievable when the number of potential participants is relatively low. While it may be impractical to opt in all NHS users for comprehensive data collection, it isn't at all impractical to opt in the fraction of a percent of NHS users who are likely to be covered by the registry arm of the programme, particularly when registration can be staggered across conditions over a period of years.

Data made more responsive

As noted above, is currently seen as a passive system for vacuuming up data that are being collected in the context of care, for secondary use in the contexts of research and administration. This thinking seriously undersells what could deliver if professionals and the public were genuinely engaged in its development.

‘Found data’ – data re-purposed for something other than for what they were gathered – may reflect some, but certainly not all questions of interest to the public, the research community, NHS commissioners and administrators. At this time, the value of depends to a great degree on one's personal perception of the extent to which data generated in the context of care will be responsive to the evidence needs of other domains.

This is a matter worthy of further study, not the least because millions of pounds are to be spent on's implementation. lacks a solid, evidence-based, business plan. The HSCIC suggests that the uses of ‘found’ health data cannot be scoped-out in advance of the programme's implementation. Whether or not one accepts this assertion (and I don't), just one addition to the goals of the programme would suffice to make a lower bound on its potential impact estimable. should actively seek to engage professionals and the general public in improvement of the data that are being collected. By implementing a process whereby data gathered in the context of care could become more responsive to research and administrative needs, it would address the most fundamental gap in our ability to improve public health and the NHS.

Call to action

Separation of into two arms as described above would allow many of the issues that continue to dog the programme to be addressed. 2.0 could be built within the resource constraints of the current design, it would harness the public's energies, and it would result in better science.

The HSCIC cannot uphold its duty to the public unless it seriously considers the possibility that a sampling based approach, together with an enhanced treasury of medical registries and active engagement of the public, would provide all the benefits of comprehensive data collection, without its serious drawbacks. An evaluation of the evidence for and against comprehensive data collection should be produced in consultation with experts in statistics and the RSS, and published in advance of any decision to roll out nationwide.

Now is the time for statisticians and the general public to call for a fundamental rethink of the programme. Show your support for serious consideration of an alternative design by writing to the HSCIC and tweeting with the hashtag #caredata2.


The views and opinions expressed in this article are the author's own personal views and do not necessarily represent or reflect the views and opinions of his employers, the University College London (UCL) or the Royal Statistical Society. For the avoidance of doubt, the author's employers have not supported or reviewed his work in regards to this article, and the author has not employed any UCL data in it.


  • 1. Report on the Review of Patient-Identifiable Information. Department of Health: The Caldicott Committee. December 1997, p. 17.

National Health Service Data privacy

Join the RSS

Join the RSS

Become part of an organisation which works to advance statistics and support statisticians

Copyright 2019 Royal Statistical Society. All Rights Reserved.
12 Errol Street, London, EC1Y 8LX. UK registered charity in England and Wales. No.306096

Twitter Facebook YouTube RSS feed RSS feed RSS newsletter

We use cookies to understand how you use our site and to improve your experience. By continuing to use our site, you accept our use of cookies and Terms of Use.