Alternative data sources such as web scraped and point of sale scanner price datasets are becoming more commonly available, providing large sources of price data from which measures of consumer inflation could potentially be calculated. Over the past few years, new methods have evolved for compiling price indices from such data sources. This article extends previous ONS research on using web scraped data to compile price indices for clothing items. These are early analyses using experimental techniques to help us develop our statistical methodology and are not comparable with headline estimates of inflation. We would strongly caution against their use in economic modelling and analysis.
Clothing items generally experience much higher rates of product churn (that is, products coming in and out of stock) compared with other expenditure categories. This is due to the fast-paced nature of the fashion industry, with high seasonality in clothing ranges and changing fashion trends. This makes it difficult to follow prices over time. For example, a new range of swimwear could be introduced at the beginning of the summer, be heavily discounted at the end of summer and then replaced entirely by winter wear clothing.
This has contributed to a number of measurement challenges when we include clothing prices in our consumer price inflation measures. For example, the clothing and footwear division was a major contributor to the divergence between the CPI (Consumer Prices Index) and RPI (Retail Prices Index) inflation measurements, this is explained in the article CPI and RPI: Increased Impact of the Formula Effect in 2010 (ONS, 2011). These difficulties mean that there is a particular interest in investigating generating clothing price indices using alternative data sources.
This article summarises analysis into using web scraped clothing data to produce experimental price indices from the article Analysis of product turnover in web scraped clothing data, and its impact on methods for compiling price indices (Payne, 2017). It also includes a number of additional indices, including the CLIP (Clustering Large datasets Into Price indices) method, which has recently been developed by Office for National Statistics (ONS). The CLIP method is considered of particular interest as it is based on clustering similar items together and may reduce the problem of product churn associated with calculating price indices using web scraped clothing data.
The data was provided by WGSN (World’s Global Style Network), a global trend authority specialising in fashion. The structure of this article is as follows. Section 2 gives some background on alternative data sources for price collection and Section 3 gives some further information on the data used. Section 4 outlines the different methods used to produce price indices with the web scraped data. In Section 5 we present results from the different methods, these indices are then compared with a special aggregate of the CPIH index. In Section 6 we summarise the findings and present areas for future research. Charts for each of the web scraped items and aggregate indices are presented in the “Data” section of this release.Nôl i'r tabl cynnwys
Alternative data sources such as scanner data and web scraped data have been enabled by technological developments in recent years. Scanner data are datasets collected by retailers as products are scanned through the till. Average prices can then be derived from this transactional data. However, scanner data may not be available for smaller retailers. We have also to date been unable to obtain scanner data from large retailers. We have therefore focused our research on investigating how web scraped data can be used.
Web scraping collects the price information directly from the retailer’s websites. A web scraper is a tool that reads the HTMLs on the website and extracts the data needed. The methods required to store, process and analyse web scraped data will also inform our use of scanner data in future.
In January 2014, we began a research project to use web scrapers to collect grocery prices from three online retailers as part of the ONS Big Data project. Since the pilot was launched, we have published a number of updates on research into using web scraped data to produce experimental price indices, including methodologies that differ from the more traditional fixed base indices such as the CPIH (a measure of consumer price inflation that includes owner occupiers’ housing costs). The most recent article was Research indices using web scraped data: May 2016 update. A further article was published in November 2016 Research indices using web scraped price data: clustering large datasets into price indices (CLIP), which looked to overcome the problem of high levels of product churn associated with web scraped data by clustering groups of products together.
This work identified a number of benefits in using web scraped data over the traditional method of data collection but it also highlighted a number of limitations. Some of the main issues with using web scraped data to calculate price indices are summarised in this section for reference.
In the traditional method of price collection, price collectors use their market knowledge to select products which are a representative sample of goods and services, while in theory web scraping collects all prices from the website. This greatly increases the coverage of goods and services available, but it also increases the difficulty of forming a representative basket as there is no expenditure information on what products are actually purchased by consumers, unless weights are available from another source. This means that all products will have the same weight.
Studies by other national statistical agencies have shown that this may lead to downward bias in indices that are calculated from web scraped data, compared with indices calculated from other sources such as scanner data (Chessa and Griffioen, 2016). This is a particular problem with online retailers (such as Amazon) who sell a large number of product lines, as shelf space is not a limiting factor.
The large number of prices collected using web scraping also leads to greater product churn (the data contains a higher number of products that move into and out of the market over time). This is less of an issue for the traditional collection as collectors are able to replace products that go out of stock by using comparable replacements if available. Collectors will also use their market knowledge to choose products that they expect to remain in stock and be representative of what consumers purchase (these products are more likely to remain in stock as retailers would keep stock levels high).
However, high product churn can cause difficulties in matching items over time. The issue is compounded when looking at the clothing industry, as it is also a sector which experiences particularly high levels of product churn due to high seasonality and changing fashion trends. These limitations affect the methodologies that can be used to calculate price indices from the web scraped clothing data.Nôl i'r tabl cynnwys
The data used were provided by WGSN (World’s Global Style Network), a global trend authority specialising in fashion. They collect daily prices and other information from a number of fashion retailers' websites. When this was supplied in 2015, WGSN obtained data for 37 clothing product categories and 38 retailers in the UK, including a mixture of high street and online only retailers. This has subsequently increased. The web scraped data includes price, product and retailer information. Women’s clothing data were provided for the period September 2013 to October 2015 and the men’s clothing data were provided for the period August 2014 to October 2015.
The dataset is therefore very large and the product categories do not necessarily match the item descriptions used for the CPIH collection. Therefore, the analysis was restricted to the following nine clothing items, which map relatively closely to items used in the CPIH classification structure. These nine items are listed in Table 1.
Table 1: Clothing items used for analysis
|ONS item ID
|Men’s Casual Shirt
|Women’s Sportswear Shorts
|Source: Office for National Statistics, World’s Global Style Network
Download this table Table 1: Clothing items used for analysis.xls (26.6 kB)
As discussed in Section 2, clothing is expected to have a high level of product churn due to the nature of the fashion industry. This was found to be the case for the nine products analysed from the WGSN data. The proportion of products in the sample over the whole period (that is, being present in the first and last month of the sample) ranged from 5.12% for men’s jeans to 0.07% for women’s coats. This shows that there is a high level of product churn in the clothing sector with the lifespan of most products being only one season. Further analysis into product churn is given in the article Analysis of product turnover in web scraped clothing data, and its impact on methods for compiling price indices (Payne, 2017).Nôl i'r tabl cynnwys
These web scraped clothing data were then used to produce research price indices. Various methods can be used to produce price indices (detailed in Annex A). In the case of web scraped data, the fact that there is no expenditure information limits the methods that can be used. The WGSN data were only considered on a monthly basis due to the size of the dataset and difficulties with processing. To calculate the monthly price for each product, a geometric average of prices is used for consistency with the Jevons formula. The article Analysis of product turnover in web scraped clothing data, and its impact on methods for compiling price indices (Payne, 2017) considered three methods of compiling price indices:
- Monthly chained Jevons Index
Further description of these methods is given in Annex A. In this article we also include the following methods to produce price indices from the clothing data:
The CLIP method is considered of particular interest as it is based on clustering similar items together and may reduce the problem of product churn associated with web scraped clothing data. It works by grouping together individual products that have similar characteristics using a clustering algorithm. Instead of more traditional methods which track individual products over time, the CLIP tracks the average price of these clusters instead. This reduces the problem of product churn as it means that when products either go in and out of stock or are rebranded, or new products enter the market, they are assigned to the clusters already found in the dataset.
The clusters are the same in each period so they can be compared: even if the exact products the clusters contain are different, they will have the same characteristics as the original products. The geometric mean of the prices of the products in each cluster is taken as the average price for that cluster. Further information on the CLIP is given in the article Research indices using web scraped price data: clustering large datasets into price indices (CLIP) (Metcalfe et al., 2016), which applies the CLIP to our web scraped grocery data. For the clothing data discussed in this article, there are extra characteristics available that help to form the clusters. One such characteristic is the style of the clothing, for example, whether or not a pair of jeans is a skinny jean or a boot cut jean is taken into account when the clusters are formed.Nôl i'r tabl cynnwys
In Analysis of product turnover in web scraped clothing data, and its impact on methods for compiling price indices (Payne, 2017), nine clothing items have been investigated (Table 1) and prices indices for these individual items were created. These were the products which best mapped to the Office for National Statistics (ONS) item IDs used in the Consumer Prices Index including owner occupiers’ housing costs (CPIH). In this article, we aggregate these items to produce three aggregate indices (Table 2). Charts for each of the items and aggregate indices are presented in the "Data" section of this release.
Table 2: Clothing items included in each aggregation
|Items included in Aggregate
|Men’s Jeans, Men’s Shorts, Men’s Casual Shirt, Men’s Socks, Men’s Pants
|Women’s Coats, Women’s Sportswear Shorts, Women’s Swimwear, Women’s Tights
|Men’s Jeans, Men’s Shorts, Men’s Shirts, Men’s Socks, Men’s Pants, Women’s Coats, Women’s Shorts, Women’s Swimwear, Women’s Tights
|Source: Office for National Statistics, World’s Global Style Network
Download this table Table 2: Clothing items included in each aggregation.xls (26.6 kB)
Published CPIH expenditure weights are used to aggregate up the item level indices to the higher level aggregates. This method is applied consistently to all index methods described. The CPIH weights are applied to the unchained indices for those indices that undergo a chaining process. These indices are then re-chained to get the indices published in the attached “Data” section of this release.
These experimental indices are early analysis to help us develop our statistical methodology for alternative sources of prices data and we would therefore caution against their use in economic modelling and analysis.
In this article, we consider three versions of the GEKS index: GEKS, RYGEKS and IntGEKS.
A GEKS index (originally proposed by Gini, Eltetö, Köves and Szulc) is one possible solution to high rates of product churn that can be implemented without introducing significant amounts of chain drift. The GEKS method essentially takes the geometric mean of all bilateral indices connecting all of the periods between the base period and the current period. GEKS indices are free from chaining issues. However, a drawback with using the GEKS index for temporal indices is that whenever a new time point is added the entire index will be revised. This is a significant problem as official price indices are very rarely revised backwards once published.
The RYGEKS is a version of the GEKS that was developed to remove this problem as it is based on a rolling year period that allows longer series to be calculated without the need to revise. For a product to be included in the GEKS or the RYGEKS it has to appear in either the base period and the intermediate period or the intermediate period and the current period. For example, if the price change between Monday and Wednesday is calculated then a product must appear in either Monday and Tuesday or in Tuesday and Wednesday.
The Intersection-GEKS (IntGEKS) index, developed by Krsinich and Lamboray (2015) is another version of RYGEKS that uses a matched set of products for bilateral comparison. These products must appear in all three periods: in the example given previously, the products must be in Monday, Tuesday and Wednesday to be included in the calculations. All the GEKS methods in this article use Jevons indices as an input into the GEKS procedure. Further information on these methods is given in Annex A.
Figure 1 and Figure 2 present the GEKS and IntGEKS for all clothing and men’s clothing respectively. Figure 3 presents the GEKS, RYGEKS and IntGEKS for women’s clothing. RYGEKS is not calculated for all clothing and men’s clothing because the time period is too short to allow for a useful comparison to be made.
At the all clothing aggregate level, both versions of the GEKS indices decrease over the period. This is also the case for the men’s and women’s clothing aggregates (although please note the different time periods covered).
IntGEKS decreases further and at a faster rate than the other GEKS indices. At the item level, some of the IntGEKS series decreased by around 75% since September 2013, such as women’s coats. This level of decrease is clearly implausible even given the nature of the fashion industry.
The RYGEKS and GEKS series have also experienced very large decreases for women’s clothing. This may be due to the longer run of data available for women’s clothing: the men’s clothing indices may decline in a similar way if we had a longer run of data. This might also be because a matched index such as IntGEKS means that the longer staying items in the dataset have more influence on the index, and these longer staying items have the possibility of having larger price decreases in order for stock to be shifted. By comparison, the GEKS and RYGEKS incorporate new products in the index and therefore these products may decrease the influence of the longer staying products on the resulting series.
Chained Jevons, FEWS and CLIP
Three other methods were considered besides the different versions of GEKS to construct price indices from the WGSN clothing data. These were the chained Jevons (referred to as the “daily chained” in previous ONS research articles), FEWS and CLIP methods.
The chained Jevons is a simple method that applies a monthly chained index to the web scraped data.
The FEWS method (Fixed Effect with a Window Splice) was developed by Statistics New Zealand (Krsinich, 2014) to account for quality change in big data sources. It works by decomposing log price into a time-based effect and a product-based effect, assuming that the change in the product-based effect accounts for quality change.
CLIP (Clustering Large Datasets into Price Indices) was developed by ONS. It clusters products into similar groups, based on the theory that consumers want to purchase different types of products rather than specific individual products. Further information on the CLIP is given in this article: Research indices using web scraped price data: clustering large datasets into price indices (CLIP). CLIP was designed to help with the problem of high product churn that exists with alternative data sources such as web scraped data and particularly with web scraped clothing data. As such, it may give more credible estimates than the other methods discussed in this article. Further information on all these methods is given in Annex A.
Figure 4, Figure 5 and Figure 6 present the chained Jevons, FEWS and CLIP for all clothing, men’s clothing and women’s clothing respectively.
For all clothing, the FEWS indices descend in a similar way to the IntGEKS indices and are therefore not plausible over a longer time series. The chained Jevons and the CLIP both show more sensible price movements with some seasonality in the indices.
In general, the chained Jevons indices decrease over the period but show a more plausible decline than the FEWS index. At the item level, some clothing items have seen large increases over the period investigated. From September 2013 to October 2015, women’s coats have increased by 25% and women’s shorts have increased by 37% as measured by the chained Jevons. By contrast, the CLIP all clothing index increases over the period from August 2014 to October 2015 by 9%. This increase is driven largely by women’s coats, women’s tights and men’s shirts, which have increased significantly over the shorter period for which all clothing data has been available. However, the CLIP indices are also more volatile.
Comparison with CPIH
There are many reasons why it is not appropriate to draw a direct comparison between the price indices presented in this section and the published CPIH. These include differences in data sources and methodology used. Further information on these differences is given in Research indices using web scraped data (Breton. R. et al, 2016).
However, it is still a useful exercise to examine the trends shown in the different indices and therefore in the final part of this section we construct special aggregates of published CPIH item indices, using only the items and weights that have been used in this analysis. This allows us to compare the price indices presented earlier in this section with published CPIH data using similar items. Nevertheless, despite the steps taken, we would expect these indices and the published CPIH to be different, given that many methodological differences remain. The FEWS, IntGEKS, RYGEKS and GEKS series are excluded as these gave more implausible results. However, it should be noted that the CPIH data is not necessarily a benchmark due to the difficulties outlined previously for the traditional collection.
Figure 7, Figure 8 and Figure 9 present the CPIH special aggregate, chained Jevons and CLIP for all clothing, men’s clothing and women’s clothing respectively.
The CPIH all clothing aggregate index remains relatively stable over the period, with a slight upward trend. This is also the case for the men’s and women’s clothing aggregates. The CLIP matches the CPIH trend, but also appears to be more volatile. However, this may be a better reflection of the seasonality displayed in the fashion industry. This is distinct from the chained Jevons, which decreases over the period investigated.Nôl i'r tabl cynnwys
Previous ONS research carried out on alternative data sources for consumer prices has focused on web scraped grocery data. Due to the growth of online retailing, it is important to extend this analysis to other retail sectors. The web scraped clothing data obtained from WGSN gives us the opportunity to apply price index methodology suitable for big data sources to a new section of the consumer prices basket. Web scraping offers the potential to improve greatly the quality and efficiency of consumer price indices. In particular for clothing, web scraped data may also offer the opportunity to overcome problems associated with the traditional collection of clothing prices, such as high seasonality and product churn.
However, there remain a number of limitations to using this data to construct price indices, including problems with processing and cleaning large datasets. Questions remain around whether all web scraped data should be used (that may reduce the representativeness of the products included within the analysis) or whether a sample should be taken. These must be resolved before the data can be put to effective use within consumer prices.
Six different methods for constructing research price indices from the web scraped clothing data were investigated: CLIP, chained Jevons, GEKS, IntGEKS, RYGEKS and FEWS. These were then compared against a special aggregate of the published CPIH index. These are early analyses using experimental techniques to help us develop our statistical methodology and are not comparable with the headline estimate of inflation. We would strongly caution against their use in economic modelling and analysis.
In all cases, the IntGEKS and FEWS methods decline rapidly over the period and therefore do not give plausible price indices from these data. While the GEKS and RYGEKS indices better reflect the seasonal movement in clothing prices, the decline is still implausibly large. This is particularly true for women’s clothing, for which we have a longer run of data. By contrast, the chained Jevons index is actually rather successful in identifying the seasonal peaks and troughs of these particular items and the general level of the indices seems reasonable. The magnitude of the index movements, however, seems greatly exaggerated.
A new approach trialled by ONS called the CLIP was also applied to the web scraped clothing data. This approach clusters products into similar groups, based on the theory that consumers want to purchase different types of products rather than specific individual products. The CLIP aggregate for all clothing increases over the period investigated, matching the CPIH trend, but it also appears to be more volatile. This may be a better reflection of the seasonality displayed in the fashion industry.
This work contributes to a growing body of research into large alternative sources of price data and its results are useful in developing methods for scanner data, as well as web scraped data. Despite the issues faced in producing price indices, web scraped data have the potential to deepen our understanding of price movements in the clothing sector in the medium-term and, in the long-term, improve the way prices are collected for national consumer price indices. This particular piece of research also contributes to work we are doing to improve the collection of clothing price data, outlined in the Consumer prices development plan.Nôl i'r tabl cynnwys
Breton R, et. al. (2015): Research indices using web scraped data
Breton R, et. al. (2016): Research Indices using web scraped data: May 2016 update
Chessa, A.G., and Griffioen, R. (2016): “Comparing Scanner Data and Web Scraped Data for Consumer Price Indices”. Report, Statistics Netherlands.
Metcalfe E, et . al. (2016): Research Indices using web scraped data: clustering large datasets into price indices (CLIP)
Office for National Statistics (2011): “Increased Impact of the Formula Effect in 2010”
Office for National Statistics (2014): Consumer price indices – technical manual
Payne C (2017): Analysis of product turnover in web scraped clothing data, and its impact on methods for compiling price indicesNôl i'r tabl cynnwys
Manylion cyswllt ar gyfer y Erthygl
Ffôn: +44 (0)1633 455171