The 35th Fisher Memorial Lecture was this year given by Professor Nancy Reid, Professor of Statistical Sciences at the University of Toronto, as part of a day-long conference on data science from the perspective of the mathematical sciences. Nancy’s lecture, titled: ‘Statistical science and data science: where do we go from here?’ gave some interesting insights into how data science might best be introduced as an academic programme and how it would fit in with the current statistics programmes.
Nancy was introduced to the conference delegates by the renowned geneticist Sir Walter Bodmer, who was himself supervised by Ronald Fisher at Cambridge. He explained that the Fisher Memorial Trust was set up after Fisher died to continue his legacy, and to encourage and promote the discussion of genetics and statistics.
Nancy Reid is a celebrated theoretical statistician; she has won numerous prizes including the Statistical Society of Canada Gold medal and the Royal Statistical Society’s Guy medal in silver for her path-breaking and influential paper 'Parameter Orthogonality and Approximate Conditional Inference', written jointly with Sir David Cox (whom she paid tribute to at the beginning of her lecture, pictured right). She is director of the Canadian Statistical Sciences Institute, an Officer of the Order of Canada and is a past president of the the Statistical Society of Canada.
In her lecture, Nancy recounted some of the perceptions she had encountered where ‘statistical science’ was unfavourably compared to ‘big data’. The latter was associated with big machines and high-level computing whereas the former was being associated with small data and therefore less 'fun'. She acknowledged that there is currently a lot of hype surrounding ‘big data’, citing a ‘hype curve’ which showed that after the hype, big data was now being questioned with articles such as Tim Harford’s ‘Big Data – are we making a big mistake?’ and Weapons of Math Destruction by Cathy O’Neil. There is also a growing suspicion of algorithms.
Nancy went on to describe some of the work she carried out for the Fields Institute on 'Statistical Inference, Learning, and Models for Big Data'. Carried out in the first half of 2015, the programme ran sessions on different themes around big data, including machine learning, optimizing and matrix methods, visualisation, health policy and social policy. There were also other topics discussed within the sessions, such as inference, environmental science, networks and genomics.
Nancy outlined a number of observations from the sessions. While it is difficult to predict the long-term impact of the rush to data science, there does seem to be an interesting mix of both old and new statistics involved. She noted that statistical models for big data are complex and high-dimensional; it’s not just that the ‘n’ is large, but that the ‘p’ is also large. Ideas of statistical inference can get 'lost in the machine'.
When it comes to visualisation of data, Nancy observed that visualisers and the number crunchers don’t always talk to each other, so you can have graphics with good data that don’t look good, while conversely there are graphics that look good but don’t have good data.
Regarding data for policy, Nancy noted that people are willing to share data if they think it will help, for example, find cures for diseases, but have grave concerns about their own privacy. De-identified data is a potential solution, but there are still differing opinions on whether de-identified data protects privacy – in July 2014, the privacy commissioner of Ontario was quoted as saying that it did work, while in the same month, a paper by Narayan and Felton said the opposite (PDF).
One of the sessions also focused on defining what data science is – is it a course? A job? A technology? A field of research? In terms of it being a course, Nancy outlined a proposed data science programme at the University of Toronto comprising strands on mathematical reasoning, statistical theory, statistical and machine learning methods, programming and software development, algorithms and data structure, and lastly, communication of results.
Data science research, she said, could cover data collection and data quality, datasets with large ‘n’ and small ‘p’ as well as datasets with small ‘n’ and large ‘p’. It could examine new types of data such as networks, graphs, digital text and images. It could include issues such as how to clean data, data management (ie, converting from raw to that which is ‘analysable’), software programming, collaboration and project management.
Nancy hopes that the area of data science will discover that the 'old core' is important, and that statistical scientists are often trying to solve a range of problems other than simply pattern recognition. Statisticians are often criticised for being too cautious, she said. However, lots of promises have been made around big data. Going back to her ‘hype curve’, Nancy concluded that the next big thing appears to be ‘smart data’.
A Q&A followed, in which the content of statistical courses were discussed – methods, data visualisations and whether they should include an element of data collection.
At the end of the discussion, Nancy was presented with a silver bowl that is given to all Fisher Memorial Lecturers and this was followed by a wine reception for all attendees.