In addition to his role as associate professor of applied mathematics at Columbia University, he is also the New York Times chief data scientist. At Columbia, he has applied data science to questions thrown up by the explosion of data from the advances in genetics. Meanwhile, the newspaper industry has been less happy with their collision with technology, but its effects are no less revolutionary. The data and the questions posed by these changes have brought about a healthy market for the skills of a data scientist.
More data leading to more questions
In his own line of work, Chris attributes the start of this to when haemophilus influenza became the first freely living organism to have its entire genome sequenced in 1995. As Chris recalls, ‘Before 1995, modelling in biology was not considered very useful among biologists. But by 1998, biologists were writing essays about how they desperately needed to collaborate with physicists and computer scientists to help them make sense of what, to them, was an astronomical amount of data.’
‘Biology was a field that had its own way of understanding the world for centuries, but then suddenly found itself awash with data because of a technological opportunity. In the late 1990s that caused a lot of existential pain by asking what is biology? What is the scientific method? How do we deal with data when we don’t have an underlying model? And that pain played out again and again in other fields of human endeavour over the next 10 to 15 years.’
He says that these evolutions in science are what gave birth to data science. ‘If look at how the term “data scientist” has been used over the last five years, most of these people are coming from non-statistical backgrounds. Many of the people for whom the phrase data science resonates are people trying to answer questions from the natural sciences with data.’
This seems to be one of the most visible dividing lines between statisticians and data scientists. Statisticians usually find their way into the field via mathematics, whereas data scientists are making their way onto the same landscape, but through the other branches of academia.
Data science as term was first coined around the turn of this century. But as Chris points out, long before this in 1962, John Tukey predicted how the approach to data would change in his paper ‘The Future of Data Analysis’. ‘If you read that paper now, it sounds a lot like the mind-set that we now call data science,’ he says. ‘Tukey was already advocating in 1962 for what I would call a data driven approach to statistics, where you let the data lead, instead of first considering what mathematical analysis is appropriate.’
For the data scientist, this often means getting to work on ‘found’ data that was not initially designed for statistical analysis. But even before that point, the question being asked of the data has to be defined, as Chris describes. ‘This is another example from what I saw in biology. There was a time where biologists would get their hands on some sequence data, and then they would ask “What are we supposed to do with this?”’
‘I can remember being part of some difficult conversations where I would say, I can cluster these genes for you, but why? I can build you a hierarchical dendrogram of genes, but what scientific question would that answer?’
Getting to the heart of the scientific conundrum forms a part of the innovation in data science. ‘You have to figure out what question they are asking. Then you have to reframe that science question as a statistical task, usually a machine learning task, and most of my work for the past 15 years has been just that.’
The expansion of data interest
This broadening of interest in statistics is a direct consequence of tech revolution that we are currently living through. Chris thinks this new route of being attracted to what can be done with data is now extending into the next generation in subtle but profound ways.
‘One force that is not going away is the nternet - students are now growing up not just computationally native but data native. Amazon tells them what books to read, Netflix what movies to watch and Facebook is curating their news feeds. They are growing up comfortable with a world where data shapes their lives and many students find it engaging to learn how that works, or even to do it themselves.' He continues, 'this isn’t orthodox computer science or orthodox statistics, but there is a space between the two fields that is going to become more and more a part of the human experience.’
However, the definition of what data science is (or should be) is still a moving target. As he describes, this makes the data scientist with the right skills a rare find. ‘Right now universities are not meeting the demand because defining data science as a job description is the fastest moving timescale, and it will take a lag before academia catches up to that. When I want to hire someone for the New York Times, it’s not easy for me to go to academia and find someone with the particular blend of skills I want.’
He continues, ‘in industry people have a meaning of data science which revolves largely around a job description, but in academia they are talking about answering questions from a scientific domain using abundant data and those skills are quite transferable whether the problem is coming from biology or social science or art history.’
So what can statistics learn from its new counterpart? In Chris’ view, ‘I think what statistics as a community would benefit from is the healthy complementarity in the physics community of theorists and experimentalists. There is a realisation in physics that both sides need each other because sometimes experiment lags theory or theory lags experiment.’
He cites various examples from the development of machine learning where ‘there are times when a theoretical advance motivates an algorithm and others where an experimental advance leads to a theoretical advance.’
Bringing data to the future of the ‘newspaper of record’
Chris’ career gives him perspective on how these advances in scientific data science are being applied outside academia. He describes how the New York Times came to seriously consider how the data they generated could help them adapt to the world of new media. Last year, the newspaper published an ‘Innovation report’ that set out how it would face the future.
He recalls, ‘the first paragraph starts “The New York Times is winning at journalism”, the first sentence of the second paragraph is “At the same time, we are falling behind in a second critical area: the art and science of getting our journalism to readers.”’
That essentially is the problem for an old media behemoth competing in a digital world. To this end, the New York Times hired Chris to give data a seat at the table when they make decisions on how to market content, through which channels, how to retain and gain subscribers and how the design of the website can be made better for the user, among other questions.
The internet has also drastically changed the relationship between the publisher and advertiser. Previously, a magazine or paper could only quote an estimated circulation along with their rate card. Chris brings up the famous John Wanamaker quote: ‘Half the money I spend on advertising is wasted - the trouble is I don't know which half.’ But this doesn’t apply in the digital age because advertisers can now access the data that gives them the granularity they need to see the true reach of their ad.
This is a major reason behind why so many of the great newspapers of yesteryear are now in financial trouble. But the New York Times is hoping that, by having people like Chris putting all their user generated data to use, they will find the edge needed to keep them in the news business.
This brings up another parallel in the skills of a data scientist and a statistician working in the commercial world. In this arena, they want answers not data, as Chris describes himself. ‘I can’t ask a person to understand how to interpret my logistic regression and five co-efficients. The job is for the data scientist to serve as a culture broker between the domain and the data.’
Chris was part of a recent panel event at the RSS looking at the relationship between statistics and data science, which is still available to view.