At RSS Conference this afternoon, Matthew Mayhew explained how the ONS is using web scraping technologies to collect price information from the websites of three of the UK’s major retailers with a view to considering whether this sort of information could, one day, supplement – or even supercede – existing measures. When scraping the websites of Tesco, Sainsbury’s and Waitrose, the ONS looks for prices of items that match those in the current ‘basket’ of goods that is used to calculate the CPI.
Of course, this isn’t straightforward: out of reams of web code, only a few dozen characters will be relevant. And then there’s the problem of product choice: statisticians might want to know the price of an apple – and only an apple – but the scraper might return information on apple juice, or pre-prepared packs of sliced apple. Decimal points can also, sometimes, go astray, resulting in loaves of bread costing £100.
But despite the problems, and the limitations – such as the fact that the three currently-tracked retailers only account for 50% of total market share, and they do not include discount chains like Aldi and Lidl – the potential of online price checking is too good to ignore. Whether or not it becomes the dominant means of calculating CPI (and, by extension, inflation), web scraping would allow ONS to track prices on a daily basis. And thanks to some data science know-how and some machine learning expertise, the accuracy and efficiency of the data collection process should continue to improve.
- Matthew Mayhew’s talk, ‘Using machine learning techniques to clean web-scraped price data via cluster analysis’, was part of session 3.4 Contributed – Data Science: Applications to Online Data