1. Abstract

In this article we estimate the value of recreational and aesthetic services provided by green and blue spaces in urban areas in Great Britain that is capitalised into property prices.

To do so, we create a unique house-level dataset by linking data from a property website to a comprehensive data set of urban green spaces, as well as data on air and noise pollution, and measures of school distance and quality. We extend the traditional hedonic pricing approach by using machine learning techniques to flexibly model house prices.

Unlike standard hedonic pricing via linear regression, our model does not rely on any assumptions regarding the relationship between house prices and the wide range of structural, neighbourhood and environmental characteristics.

We compute partial dependency plots to display the marginal effects of green and blue spaces on house prices and test whether they are linear. We then compute estimates of the value of cultural (recreational and aesthetic) services provided by green and blue spaces in urban areas.

Nôl i'r tabl cynnwys

2. Introduction

Almost one-third of the urban area in the UK consists of natural land and green or blue spaces . Urban green spaces1 are a type of natural asset that provide society with a range of benefits. The Office for National Statistics (ONS), together with the Department for Environment, Food and Rural Affairs (Defra), are developing natural capital accounts for the UK to offer a comprehensive and consistent framework to organise environmental information so that the benefits of nature are better recognised.

In this article we focus on estimating the value of the cultural services of urban green spaces. Cultural services of green spaces primarily consist of recreation and aesthetic views, which are difficult to value, because they can be enjoyed for free. Following the previous literature2 , we estimate the value of these cultural services using a hedonic pricing approach. The aim is to isolate the contribution of urban green and blue spaces to property prices, and account for the effect of other environmental services such as noise and air pollution reduction.

To do so, we create a unique house-level dataset by linking data from a property website to comprehensive information about availability of urban green and blue spaces. We also add data on air and noise pollution and measures of school distance and quality. As a measure of recreational service, we focus on the distance to the nearest green and blue spaces as well as the area of all green and blue spaces within 500 metres of the property. Based on the property description from the website, we assess whether the property has a view over a green or blue space and use this information as a measure of aesthetic services.

Our main contribution is to extend the traditional hedonic pricing approach by using a boosted tree method to flexibly model house prices. Unlike standard hedonic pricing regression, our model does not rely on any assumptions regarding the relationship between house prices and the wide range of structural, neighbourhood and environmental characteristics. As a result, we can test if distance from and area of green and blue spaces jointly affect property prices and if these effects are linear and differ across geographical areas.

Another advantage is that we can control for observed factors in a more flexible way, therefore reducing the bias caused by misspecification of observed variables. It also allows us to capture spatial correlation by using a flexible function of longitude and latitude.

We find that the average contribution of green and blue spaces to property prices is £2,813.8, which is about 1.2% of the average property price in our sample.

The remainder of this article is structured as follows. In the next section we describe the data we use in this article and report some summary statistics. We then present the methodology used to model property price and estimate the effect of green and blue spaces on property price. We then present our results, and finally discuss the next steps.

Notes for: Introduction
  1. See the ONS ecosystem accounts for urban areas

  2. See Irwin, 2002 ; Nicholls and Crompton (2005), Gibbons and others 2014; Schläpfer and others 2015).

Nôl i'r tabl cynnwys

3. Data and summary statistics

Data

Zoopla

We use data on property transactions from Zoopla, a UK-based property website. The dataset was originally provided for the previous iteration of this project by Zoopla Limited to the Urban Big Data Centre (UBDC) and includes information for over 1 million properties sold in Great Britain between 2009 and 20161.

Information includes location, number of bedrooms, number of reception rooms, property type, for sale or rent, asking price, sale price and so on. The data provided by Zoopla also contain a description of the property, which we use to fill in missing information about property type and number of bedrooms. We also use the textual description to extract additional characteristics, for example, whether it has a garage, has been recently renovated, or has a fireplace.

Most importantly, we examine the textual description associated with each property to determine whether that property has a view over a green or a blue space, such as a park, a river or the sea. Unlike studies that use a geographic information system (GIS) to detect views over green spaces (for example, Lake and others, 2000; Paterson and Boyle, 2002) we use contextual-based detection in the description:

  • first we identify descriptions that mention the following key words: overlook, views, backing, surrounded, outlook, look, opposite

  • we then examine whether the 40 preceding or following characters contain a set of words indicating green or blue spaces2

It should be noted that this measure of a view is experimental. Validating the quality of this derived variable is challenging since we do not have any true information as to whether a property has a view or not. A small amount of manual work was carried out in order to check the results of the contextual matching – which seemed relatively accurate – but false positives and false negatives are still likely to exist.

Ordnance Survey

In the previous iteration of this project the Ordnance Survey (OS) created a wide range of variables that may influence residential property prices for the purposes of the hedonic pricing method (HPM).

These variables were derived through the geospatial analysis of multiple OS datasets, both open data and premium data available through the Public Sector Mapping Agreement (PSMA), as well as other third-party datasets, all government published data, from the Office for National Statistics, Land Registry, Natural England and Natural Resources Wales.

The variables produced by the OS have been re-used for this project. The main environmental variables are the distance to publicly accessible green spaces (PAGS) and blue space, as well as the area of PAGS, blue spaces and natural land cover within 100, 200 and 500 metres of the property3. The precise definition of natural land cover, PAGS and blue space is:

  • natural land cover – any land cover classified as being natural in type, for example, grassland, heath, scrub, orchards, coniferous trees and so on; it does not include inland water bodies and can range from large woodland areas to small grass verges

  • publicly accessible green spaces – Ordnance Survey defines the following as types of green space: public parks or gardens, play spaces, playing fields, sports facilities, golf courses, allotments or community growing spaces, and religious grounds and cemeteries; these spaces contain natural land cover and can also include some blue space, for example, a park that has a lake within it

  • blue space – all inland water bodies, for example, rivers, lakes, ponds, canals and so on

We also use OS OpenData to calculate: approximate minimum distance to a railway line and approximate minimum distance to the coast. The coast is approximated by the high water line at a reduced resolution of 1/80 to save computation time as the original file is extremely large. The motivation for creating this variable is to account for houses that potentially had little access to PAGS but were near the sea. We calculate the minimum distance by matching the coordinates of houses to the nearest coordinates of the coastline data.

OS data were provided as a combination of a bespoke dataset and other OS open data. The open data were used under the Open Government Licence.

School data

There is a strong correlation between house prices and proximity to school4 . We incorporate data on proximity to schools using data on school location and quality from Ofsted and Estyn, the respective English and Welsh school inspection bodies. We were unable to include data on Scottish schools as Education Scotland only inspect a sample of schools and educational establishments are not given an overall inspection outcome in the same way that Ofsted and Estyn provide. We link the school data to our sample of residential properties and compute:

  • distance to the nearest primary, secondary, and post-16 school

  • inspection rating of the nearest primary, secondary, and post 16-school

An important caveat is that that Ofsted is not responsible for the inspection of private schools in England, and as such, these educational establishments are not present in the data. However, private schools’ admission rules are typically not based on a catchment area and so may not affect our analysis too much.

School data from Estyn and Ofsted were used under the Open Government Licence.

Noise pollution

Noise pollution has been found to be associated to house prices (for example, Levkovich and others, 2016), and is likely to also be linked to green and blue spaces. Including data on noise pollution in our model helps us identify the cultural services of green spaces, as we hold noise and air pollution constant.

For the purposes of this project we have only focused on noise pollution data from major roads and major railways across England, Wales and Scotland produced by the Department for Environment, Food and Rural Affairs (Defra)5. This type of data exists in the form of a shapefile from which we can access spatial polygons with various assigned attributes including their boundaries and noise-class. The noise-class is defined over several bins:

  • x ≤ 54.9 dB
  • 55.0 ≤ x ≤ 59.9 dB
  • 60.0 ≤ x ≤ 64.9 dB
  • 65.0 ≤ x ≤ 69.9 dB
  • 70.0 ≤ x ≤ 74.9 dB
  • 75.0 dB ≤ x

There are several different measures of noise pollution that take the time of day into account. We chose the Lden measure, which indicates the 24-hour average noise level with separate weightings for evening and night periods. The noise levels are measured on a 10 metre grid at receptor height of 4 metres above the ground, and the polygons are then formed by merging the neighbouring cells with the noise classes described above.

The three different possible noise metrics as well as an explanation of what noise sources were included in the 2012 noise mapping dataset.

There is a more recent edition of the strategic mapping that was carried out in 2017. But as this lies outside the period which our Zoopla data covers, we decided to use the strategic mappings from 2012.

Noise pollution data from Defra were used under the Open Government Licence.

Air pollution

We also use air pollution data from Defra, which is produced every year under their Modelling of Ambient Air Quality (MAAQ) contract. For each pollutant a grid with a resolution of one square kilometre are produced for which the most recent modelling methodology (PDF, 12.4MB) was published in 2015 by Ricardo Energy and Environment.

All the measurement units of the pollutants used in this project are:

  • NO2: Annual Mean

  • CO: Annual Mean

  • SO2: Annual Mean

  • Ozone: DGT120 (number of days on which the daily max 8-hr concentration is greater than 120 µg m-3)

  • Benzene: Annual Mean

While the housing data span between 2009 and 2016, matching houses to an air pollution value from any single year for any pollutant was highly time-consuming. This meant that it was not feasible to match houses to the levels of air pollution that were present in the year that they were sold. Additionally, the levels of CO pollution were only available until 2010. So for consistency, only pollution data from 2010 are used.

Air pollution data from Defra were used under the Open Government Licence.

Defra air pollution datasets are available.

Index of Multiple Deprivation

We use the Index of Multiple Deprivation (IMD) (2015) in order to capture socio-economic characteristics across England and Wales at the Lower layer Super Output Area (LSOA) level. Including the IMD in our model allows us to account for various neighbourhood characteristics such as employment, education, health and crime. The IMD is included as LSOA rankings in our model. This offers more flexibility than using deciles.

Output Area Code

To further account for the socio-demographic characteristics of the local areas, we include the 2011 Output Area Classification (OAC) in our model. This geo-demographic classification is based on 2011 UK Census data and was derived by Gale and others (2016). The OAC aims to identify areas of the country with similar characteristics. Each output area (a cluster of postcodes, with an average of 125 households) is classified into one of 76 categories, which provide summary indicators of the social, economic, demographic and build characteristics of small areas.

National Grid UK

We also use publicly available data6 from © National Grid UK to derive several other variables such as the distance to the nearest substation, distance to the nearest tower and approximate distance to the nearest overhead line. These features could have a negative effect on house prices and are likely to be correlated with access to green spaces.

Sample restriction

We restrict our sample to properties in urban areas that were sold between 2009 and 2016 in England and Wales (because the school data were not available for Scotland). We remove duplicate records and the few records with missing values. We exclude the bottom and top 0.5% of the distribution of property price. Our analytical sample contains 1,101,012 observations.

Summary statistics

Table 4 in Appendix A shows summary statistics for our sample for the variables included in the model. In our sample the average price – where price is deflated and expressed in 2016 value – is £254,345. Figure 3 in Appendix A shows the distribution of the main variables and we can see that property prices follow a log-normal distribution, with a median of £208,007.

About 6.7% of properties in our sample have a view over green or a blue space. Residential properties in England and Wales are on average 257.2 metres away from a publicly accessible green space (PAGS) and 372.6 metres to a blue space. Within a 500 metres radius, the average area of PAGS is 159,736 square metres whilst the average area of blue space is 52,955 square metres. The average area of natural land cover is 236,710 square metres. However, as shown in Figure 3, which shows the distribution of the main variables, there is substantial variation in access to green and blue spaces.

In our sample, 25.7% of properties do not have any blue space site within a 500 metres radius and 6.4% have no access to any PAGS. By contrast, there is some natural land cover within 500 metres of all properties, and we can see from Figure 3 that natural land cover is more evenly distributed than PAGS or blue spaces.

Table 3 also shows summary statistics for all the covariates included in the model (except for Output Area Classification and geographical coordinates).

In Figure 4 in Appendix A we show how property prices vary with distance to an area of PAGS and blue spaces. The association between property price and distance to PAGS and blue spaces follows a U shape: prices are high for properties very close and relatively far from PAGS and blue spaces.

Properties far from PAGS and blue spaces may be more centrally located and therefore have access to other amenities or be more likely to be in expensive cities such as London. The association between property price, and area of PAGS and blue spaces is non-monotonic too, and difficult to interpret. This is because the area of PAGS and blue spaces are likely to be correlated with many other import factors that determine price. The area of natural cover is negatively associated with house prices, although price seems to increase with area at the top of the distribution.

Notes for: Data and summary statistics
  1. In the previous iteration of this release, we used not only sold properties but also properties for sale or under offer. Further investigation of the data revealed that in many cases the same property was listed several times under different sale status. Therefore, we only focus on the records that are classified as sold.

  2. List of words: green, woods, woodland, lake, river, riverfront, field, recreation, communal ground, reservoir, golf, quay, water, marina, sea, countryside, public, communal garden, reserve, forest, recreation, course, playing.

  3. The area of PAGS within a given radius of the property includes the total area of the PAGS that have an access point within a given distance from the property.

  4. See for instance this publication from Department for Education

  5. The data are at: England Rail, England Road, Wales Rail and Wales Road, Scotland Rail and Scotland Road

  6. Shapefiles for the transmission network can be found on the National Grid website.

Nôl i'r tabl cynnwys

4. Methods

The purpose of the article is to estimate the effect of urban green and blue spaces on property prices. Hedonic regressions are traditionally used to model the price of a property as a function of its attributes (Rosen 1974). Here we adopt a similar approach but use a non-parametric model to relax the assumptions associated with linear regression.

We model the property price as a function of environmental factors, house characteristics and neighbourhood and geographical characteristics:


where pricei,t is the real property price of house i sold at time t, deflated using the House Price Index (HPI). The vector of environmental characteristics envi is the main focus of our analysis and includes the distance to the nearest blue space, distance to the nearest publicly accessible green space (PAGS), area of all PAGS, blue spaces and natural land cover within 500 metres of the property1.

Our model also contains information on whether the property has a view over green or blue spaces. Whilst the distance and area of PAGS and blue spaces are measures of the recreational services provided by the natural environment, having a view over a green or a blue space can be seen as an aesthetic service.

hci,t is a vector of house characteristics. It includes number of bedrooms, property, building and garden area (square feet) and property type. We also derive a set of attributes from the description, such as period of the house (for example, Georgian, Victorian, Edwardian), and features that are likely to influence property prices (for example, garage, presence of original features, whether the property has been renovated recently).

ni is a vector of neighbourhood and geographical characteristics. It contains distance to amenities other than green and blue spaces such as transport infrastructures (for example, bus station, railway station), retail area and workplace centroid, as well as distance to features that might negatively impact house prices (such as overhead line and substation). It also includes distance to the nearest school, and the quality of the nearest school, as well as measures of air and noise pollution (a detailed description can be found in the data section). The socio-economic characteristics of the local area are captured by the Output Area Classification, a 76-category socio-demographic classification based on 2011 UK Census data, and the Index of Multiple Deprivation (see data section for more details). To capture unobserved area characteristics (spatial dependence), we include in our model an unspecified function of longitude and latitude. This approach is very similar to the generalised additive models (Hastie and Tibshirani, 1986) commonly used to capture spatial dependence (Geniaux and Napoléone, 2008). We also include the year when the property went on the market ( yeart ) to capture time effect.

Most hedonic pricing studies use linear regression to model property price and therefore assume that each feature has a linear effect on (log) house prices and that this effect does not depend on the other features2. This is a very strong assumption in our case, because property prices are likely to be a complex function of house characteristics, amenities and location: for instance, the effect of environmental amenities may be larger in parts of the country, or for properties with specific attributes. Interaction terms can be added in linear regression models to introduce more flexibility, but this may lead to a large number of predictors, making it difficult to estimate in practice.

Relaxing the assumption that the effect of each variable is linear3 and independent allows us to better capture the effect of environmental characteristics on house prices: for instance, a large area of green spaces may increase house prices more if the house is close to a green space and also has large blue spaces nearby. The effect of environmental characteristics on price may also differ depending on the house characteristics and neighbourhood factors.

This approach allows us to see if environmental amenities have heterogeneous effects on price. Finally, controlling for other factors in a more flexible way reduces the magnitude of unobserved heterogeneity4 and allows us to capture spatial correlation by using a flexible function of longitude and latitude.

We use a tree-based model to obtain the estimate:


Decision trees are a non-parametric machine learning algorithm, which can be applied to both classification and regression tasks. Unlike linear regression models, which are parametric, decision trees make no assumptions about the functional form of the data or the distribution of any model parameters.

While linear regression models the entire dataset as one function, decision trees split the space into homogenous subspaces, and then model the subspace with a simple function (usually the average). The main advantage of decision trees is their ability to handle data generated by complex non-linear and non-monotonic multivariate functions. Because a single tree is unlikely to produce a very accurate model, we use extreme gradient boosting to generate an ensemble of trees. The individual trees are built sequentially such that the next tree in the sequence attempts to minimise the errors made by the previous tree. For a more detailed explanation about decision trees for regression, see Appendix B.

Estimation procedure

Here we give details about the estimation procedure of our model where we use the R library XGBoost to estimate our model. XGBoost is unable to handle categorical variables and only accepts numerical input. Therefore, categorical variables should either be one-hot encoded or converted as numeric, as a tree-based model will split the numeric variable flexibly. However, this may be computationally more expensive.

All our categorical covariates excluding the Output Area Classification (OAC) are one-hot encoded. The OAC is converted as a numeric value to reduce the dimensionality of our dataset – 168 variables down to 93 variables.

We split the data into three partitions: train (56%), validate (30%), test (14%). The purposes of these three partitions are as follows:

  • train: the training partition is used to train the model in a supervised fashion; the model can access the true house prices for this partition of the data, which it uses to inform the learning process

  • test: we use the test set to tune the hyperparameters of our model to improve performance and generalisability

  • validate: this partition of the data is never shown to the model until we have decided on our final model; we use this dataset to conduct the analysis

The total size of the dataset (before partitioning) is approximately 1.10 million rows with 93 variables. Given that a model learns against 56% of these rows (roughly 600,000), training and testing is not a trivial task; due to the very large number of customisable hyperparameters that XGBoost gives the user access to, tuning is a lengthy process.

Grid search is the most thorough method for finding the optimal set of model hyperparameters. However, given the dimensionality of the problem (10 and over potential hyperparameters to tune ) this parameter space is far too large to search exhaustively.

In light of this we adopt a manual approach; focusing on tuning the most important hyperparameters: maximum tree depth, number of trees, learning rate, and optimise these as best as we can. Then we optimise the regularisation parameters: l1 & l2 regularisation, γ (minimum loss required to make a further tree partition on a leaf node), subsample of covariates to consider partitioning on per tree, level or node. The hyperparameters chosen for this analysis can be found in Appendix C in Table 6.

Once we have optimised our model against the test set, we calculate the appropriate performance metrics against the three partitions of the data. We have chosen to use the mean absolute error (MAE) for its ease of interpretability and the adjusted R squared ( Radj2 ). We prefer to report the Radj2 in this instance as we wish to penalise overcomplicated models with an excessive number of variables.

Interpreting the model

Our primary focus is to estimate the effect of environmental amenities on property prices. In a linear model, the coefficients show the conditional association between the independent variables and the dependent variable. In decision trees, assessing the relationship between independent and dependent variables is more complex. However, it is possible to compute marginal effects by estimating the partial dependency (PD) function, which shows how the prediction changes when the variables of interest vary (Zhao and Hastie, 2019).

We use PD plots to visualise the marginal effect of environmental amenities on house prices, conditional on all the other variables included in our model. We show the estimates of the marginal effect of PAGS and blue spaces separately. The partial dependency function for PAGS is defined as the predicted value of house prices for different values of distance and area of PAGS, holding other factors constant. It is given as:


Where pags is a vector containing the distance to the nearest PAGS and the area of PAGS within 500 metres of the property. xi is a vector containing all the characteristics of property i . wi is a set of weights aiming to make our sample representative of the stock of residential properties. The weights are derived using data from the Valuation Office Agency breaking down the number of properties by property types, number of bedrooms and region5. For a given level of PAGS distance and area, the value of the PD function is calculated as a weighted average of the predicted values for all the properties in the sample (n), with area and distance fixed at this given level6. To avoid overfitting, we estimate f ̂pags (pags) on a validation dataset that was not used to train and test the model. Because pags has two dimensions (distance and area), the result f ̂pags is a function of two variables therefore should be plotted with three dimensions.

The relationship estimated via the PD function can only be interpreted causally if the error term does not contain any factor that influences both property prices and the availability of environmental amenities (ignorability assumption). To make this assumption more likely to hold, we include a wide range of neighbourhood characteristics in our model (distance to amenities, school quality, air and noise pollution, socio-economic classification – see data section for more details).

Also, our models can flexibly capture non-linearities and interactions between the characteristics, reducing the risk that functional misspecification biases the estimates.

Finally, we include an unspecified function of latitude and longitude to capture spatial autocorrelation. However, as with any observational study we cannot test the ignorability assumption and therefore we cannot be fully certain that our estimates reflect a causal relationship.

Valuation of monetary stock

To obtain an estimate of the average effect of green and blue spaces on house price, we estimate the difference between the predicted price based on the real data and the predicted price if there were no green and blue spaces5 :

where wi is a set of weights aiming to make our sample representative of the stock of residential properties. The average value of green or blue space is calculated based on all the properties that are in our sample, including those that have no access to green and spaces. We can obtain an estimate of the value capitalised into property prices of the cultural services provided by green and blue spaces by multiplying this estimate to the number of residential properties in the UK. The recreational services are measured by the distance and area of blue and green spaces whilst the aesthetic services are captured by the view over green or blue spaces. We obtain 95% confidence intervals via bootstrapping7.

Notes for: Methods
  1. A linear regression assumes that

  2. Or another specified functional form.

  3. Omitted interaction terms would end up in the residual εi,t

  4. A comprehensive list of tuneable hyperparameters can be found on the XGBoost Parameters Documentation

  5. The weights are derived for property types X number of bedrooms X region cells. For each cell j, the weight is equal to

    where nj is the number of properties of cell j in our sample, and Nj is the number of properties of cell j in the VOA data. A weight greater than one is applied to properties under-represented in the Zoopla data whilst properties that are over-represented are given a weight lower than one.

  6. This is very similar to the method used to compute marginal effects for generalised linear models.

  7. We set areas of green and blue spaces to 0 and distance to 500 metres, and view of green or blue spaces to zero.

Nôl i'r tabl cynnwys

5. Results

Partial dependency plots

The partial dependency function shows the predicted house prices for various distance and areas of publicly accessible green space (PAGS), holding all other characteristics constant. It describes the joint effect of distance and area of PAGS on property price.

In Figure 1 we report the effect of distance and area on property price as percentage difference compared with being further than 500 metres away from any PAGS (and therefore having no area of PAGS within 500 metres of the property)1. To better display the effects of distance and area on estimated average property price, we plot the data in two different ways.

Panel A of Figure 1 displays how property price varies with the distance to nearest PAGS for various areas of PAGS within 500 metres of the property, holding all other characteristics constant. Panel B shows how property price varies by area of PAGS, for several various distances to PAGS.

Overall, we can see that being further away from a PAGS reduces property prices, for any area of PAGS. Having large areas of PAGS within 500 metres of the property is associated with an increase in property price.

Being close to large areas of PAGS attracts the largest premium: a property close to a large PAGS is on average about 3.5% (£8,664.0) more expensive than a similar property far from any PAGS. Whilst the effect of area is almost linear, both plots also show that the relationship between property price and distance is non-linear; plot A showing increasingly flat lines and plot B showing decreasing amounts of space between lines.

For example, for any fixed area, being 400 metres instead of 500 metres away from a PAGS makes a negligible difference to the estimated average property price, whilst the difference between 100 metres and 200 metres is substantial. A property with 100,000 square metres of PAGS within 500 metres decreases in average predicted value by 1.0% (£2,421.2) if moving from being very close to 100 metres, while an equivalent property moving from 400 metres to 500 metres will only decrease in average predicted value by about 0.1% (£229.6).

Figure 2 shows the joint effect of distance and area of blue spaces on house prices. These plots are obtained using the same method as that used to obtain the partial dependency plot (PDP) for PAGS but show how price varies when distance and area of blue spaces change, holding everything else constant. Results are expressed as percentage difference compared with being further than 500 metres away from any blue spaces2.

Overall, we can see that the relationship between area and distance to blue spaces and house price follows a similar non-linear relationship to the one found for PAGS. Properties close to large blue spaces (30,000 square metres) are on average 3.4% (£8,397.7) more expensive than comparable properties with little access to blue spaces. However, the effects of proximity to blue spaces diminish faster than the effect of proximity to PAGS.

We also estimate the marginal effect of having a view over a green or a blue space. We do so by taking an average model prediction where all houses have been fixed to have a view over a green or a blue space and subtract it from an average model prediction where all houses have been fixed to not have a view over a green or a blue space. We find that having a view over a green or a blue space increases property price by £5,369.7 (2.0%) holding everything else constant.

The value of urban green and blue spaces

As explained in the Methods section, we estimate the difference between the predicted price based on the real data and the predicted price if there were no publicly accessible green spaces (PAGS) nor blue spaces to obtain an estimate of the average effect of PAGS and blue spaces on property price. To simulate the absence of PAGS and blue spaces, we set areas of PAGS and blue spaces to 0 and distance to 500 metres, and view of green or blue spaces to zero.

In Table 1, we display estimates of the value of cultural services capitalised into property prices by year. Based on 2016 data, the average value of PAGS and blue spaces embedded in property prices is £2,813.8 [95% confidence interval (CI): 2,401.5 to 3,089.0], which is 1.2% of the average property price in our sample. This average is calculated on a sample that includes properties that have no access to PAGS nor blue spaces, and therefore can be readily used to obtain an estimate of the overall value of cultural services of urban green and blue spaces that are capitalised into property prices.

To do this, we need to multiply this figure by the number of residential properties in 2016 (27.7 million) in the UK2. We obtain an estimate of £77.9 billion [95% CI: 66.6 to 85.6 billion] for the stock value of PAGS and blue spaces capitalised into property prices. The estimate is lower than the estimate published in the last version of this study, in which we used a linear model. This is probably because our tree-based model captures more heterogeneity and reduces the amount of bias. Out of this total, 12.1% (£9.43 billion) can be attributed to aesthetic services measured by having a view over a green or a blue space), whilst the remaining can be attributed to recreational services.

In Table 2 we show the estimates of the average value of cultural services of urban green and blue spaces capitalised into property prices by travel to work area (TTWA). We report both the absolute value and the value relative to the average property price in the area.

These estimates show the average contribution of PAGS and blue spaces to property prices in each TTWA. We observe substantial variation across TTWA. The average value of cultural services of urban green and blue spaces capitalised into property prices as proportion of average property price is highest in Bath (3.7%) and is above 2% in Manchester, Liverpool, Cardiff, Newcastle, Oxford, York, Cheltenham and Canterbury. The contribution of green and blue spaces to property price is about 1%. The variation in the value of cultural services capitalised into property prices across TTWA could be because of difference in the availability of green and blue spaces, but also to differences in returns to living close to green and blue spaces.

Notes for: Results
  1. The reference price is £245,763.7.

  2. Ministry of Housing, Communities and Local Government : Live tables on dwelling stock including vacants

Nôl i'r tabl cynnwys

6. Conclusion

In this article we estimate the value of recreational and aesthetic services provided by green and blue spaces in urban areas in Great Britain that is capitalised into property prices. We extend the traditional hedonic pricing approach by using machine learning techniques to flexibly model house prices.

The main benefit of using a non-parametric approach is that we make no assumptions regarding the relationship between house prices and the wide range of structural, neighbourhood and environmental characteristics. The gradient-boosted regression tree model we use allows us to control for observed factors in a fully flexible way, therefore reducing the bias caused by misspecification of observed variables.

We find that the distance to publicly accessible green spaces and blue spaces has a non-linear effect on house prices. We also find that the effect of the distance to PAGS and blue spaces depends on the area of the PAGS and blue spaces. For instance, a property close to a large PAGS is on average about 3.5% (£8,664.0) more expensive than a similar property far from any PAGS. Having a view over a green or a blue space further increases property price by £5,369.7 (2.0%).

We then used the results from this model to estimate the total value of PAGS and blue spaces that is capitalised into property prices. We find that PAGS and blue spaces increase property price by 1.2% (£2,813.8). We estimate that the total value of PAGS and blue spaces that is capitalised into property prices amounts to £77.9 billion.

An important limitation of this study is that we do not distinguish between the different types of PAGS, such as parks, playing fields, or allotments when modelling property price. Further work should look to model their respective effects on property prices as is it unlikely that they all equally valued by home buyers. This would allow for a more in-depth analysis into the value of each individual type of green space but would also improve the estimate for the overall value of PAGS.

Nôl i'r tabl cynnwys

7. Authors

Luke Lorenzi and Vahé Nafilyan, Office for National Statistics.

The authors are grateful to Steve Kingston at Ordnance Survey for providing us with the data on green and blue spaces. The authors thank Ordnance Survey for giving us permission to use these data for this publication, and Amy Brownbill, Brett Day, Adam Dutton, Gareth James and Colin Smith for useful comments.

Nôl i'r tabl cynnwys

8. References

Gale, C. G., Singleton, A. D., Bates, A. G., and Longley, P. A. (2016). Creating the 2011 area classification for output areas (2011 OAC). Journal of Spatial Information Science, 12(2016), pages 1 to 27

Genius, G. and Napoléon, C. (2008). ‘Semi-parametric tools for spatial hedonic models: an introduction to mixed geographically weighted regression and goodtime models’ in Hedonic Methods in Housing Markets Springer, pages 101 to 127

Gibbons, S., Moura to, S., and Rezende, G. M. (2014). The Amenity Value of English Nature: A Hedonic Price Approach. Environmental and Resource Economics

Hastie, T. J. and Trispirane, R. J. (1986). Generalized Additive Models. Statistical Science volume 1, pages 297 to 318

Irwin, E. G (2002) The Effects of Open Space on Residential Property Values. Land Economics Volume 78, pages 465 to 480

Lake, I. R., Lovett, A. A., Bateman, I. J. and Day, B., (2000). Using GIS and large-scale digital data to implement hedonic pricing studies. International journal of geographical information science volume 14(6), pages 521 to 541

Levkovich, O., Rouwendal, J., and Marwijk, R. Van. (2016). The effects of highway development on housing prices. Transportation, volume 43(2), pages 379 to 405

Nicholls, S. and Crompton, J. L. (2005), The impact of greenways on property values: Evidence from Austin, Texas, Journal of Leisure Research, volume 37, number 3, page 321

Paterson, R. W. and Boyle, K. J. (2002), Out of sight, out of mind? Using GIS to incorporate visibility in hedonic property value models, Land Economics, volume 78, number 3, pages 417 to 425

Rosen, S. (1974) Hedonic prices and implicit markets: Product differentiation in pure competition. J. Polit. Econ. volume 82(1), pages 34 to 55 (1974)

Schläpfer, F., Waltert, F., Segura, L. and Kienast, F. (2015) Valuation of landscape amenities: A hedonic pricing analysis of housing rents in urban, suburban and periurban Switzerland. Landscape Urban Plan volume 141, pages 24 to 40

Zhao, Q., and Hastie, T. (2019), Causal Interpretations of Black-Box Models. Journal of Business & Economic Statistics, 0(0), pages 1 to 10  

Nôl i'r tabl cynnwys

9. Appendix A: Summary statistics

Nôl i'r tabl cynnwys

10. Appendix B: Regression trees

Tree-based models for regression

Decision trees are a non-parametric machine learning algorithm, which can be applied to both classification and regression tasks. Unlike generalised linear models (GLMs) they are non-parametric and so make no assumptions about the functional form of the data or the distribution of any model parameters, which makes tree-based learners inherently different from GLMs.

Whilst the complexity of a GLM is only determined by the number of variables included in the model, the complexity of a decision tree has no upper limit and is likely to increase with more training data. The implications of this are that decision trees can be more computationally expensive to train, but are able to represent more intricate, non-linear relationships between variables, whereas a GLM (with no added interaction terms) assumes that all variables act independently of each other and have a linear relationship with the link function1.

Whilst GLMs model the entire dataset as one function, decision trees split the learning space into homogenous subspaces. Therefore, they can handle highly non-linear and non-monotonic multivariate functions.

The algorithm used to train a decision tree depends on whether the task is a regression or classification problem2. Since house price is continuous, this is a regression task.

A decision tree features three key elements: the root node, the internal nodes, and the leaf nodes. The root can be thought of as the base of the tree and contains the entire space of the data. If a decision tree consists of only a root, then regardless of the type of tree, all observations will be predicted to be the same value. This is equivalent to having a GLM with only an intercept term.

Leaving a decision tree in this state will almost never be enough for generating predictive power; complex problems involving large datasets – both in observations and dimensionality – will require the data to be split or partitioned according to some criterion. This criterion is dependent on the nature of the problem – classification or regression.

After the root has been initiated, the tree building algorithm will choose a way to partition the data such that the model minimises some function.

For regression, the data are partitioned in such a way that the algorithm tries to minimise the sum of the squared errors (SSE), given by:

Here, N is the total number of observations, yi is the true value of the quantity we are trying to predict, and y ̂i is the predicted value.

This process happens recursively until some stopping condition is met. Popular stopping conditions to prevent overfitting include imposing a maximum tree depth (the maximum number of levels that a tree can have) or a minimum leaf size (if a chosen split creates a leaf that contains less than this number then we do not perform the split).

Making predictions using classification or regression trees is relatively straightforward. For classification, each observation in the data will “belong” to a unique leaf dependent on the path along the tree that the observation followed. The final prediction will simply be the most common class occurrence at the leaf node. For the latter, the final prediction will be the average of all the true values at the leaf node.

Extreme gradient boosting with XGBoost

Before we outline a high-level overview of the theory behind of XGBoost, it is quickly worth noting that we briefly experimented with other tree-based regression models, including Random Forests. However, after some initial investigation we chose XGBoost as our primary modelling method as it consistently displayed better performance over the other algorithms.

For the purposes of this project we decided to use extreme gradient boosting as implemented by the XGBoost library in R. In the interest of conciseness, we have kept our explanation of XGBoost fairly high-level, however, the documentation created by the developers outlines the motivation and technical details in much greater depth.

As a general principle, gradient boosting is the process of using an ensemble of weak models to create an overall strong model. The term “ensemble” here simply means a collection of models, or in our case, a collection of decision trees. These individual trees are built sequentially such that the next tree in the sequence attempts to minimise the errors made by the previous tree. After enough boosting rounds we hope that the errors made by the most recent tree diminishes to the point where no improvement can be made while not overfitting to the data.

The XGBoost library provides a powerful implementation of gradient boosting for both linear and tree-based models. But as discussed earlier, we prefer not to make any assumptions about the functional form of the relationship between house prices and the variables, hence we choose tree-based models.

The library also gives us access to many hyperparameters that can be tuned to improve model performance including through regularisation techniques. This makes the modelling process more time-consuming, especially since individual models are not quick to train due to the size of the data. However, this is eventually rewarded with improved model performance compared with implementing less complex machine learning algorithms. The final model inference is presented in the Results section.

It is important to highlight that XGBoost models make predictions differently to the “classical” decision trees described in the previous section. As an XGBoost model is an ensemble of decision trees, each tree contributes to the overall final prediction, and for ensembles of trees the way in which each tree contributes depends on the algorithm.

For example, Random Forest and XGBoost models are both ensembles of decision trees but they make their predictions differently. In the case of XGBoost – as with a single decision tree – each observation in the data will belong to a unique leaf in the tree that has a weight w, and suppose we have N trees in our ensemble. Then the final prediction will simply be the sum of the weights across all N trees.

Notes for: Appendix B: Regression trees
  1. GLM interaction terms can be added manually but this can very quickly increase the dimensionality of the data – something we prefer to avoid.

  2. Here we present a high-level view of this process, but for more information please see classification and regression trees (PDF, 3.05MB).

Nôl i'r tabl cynnwys

11. Appendix C: Model hyperparameters

Here we present the values of the model hyperparameters. Any hyperparameters not listed here can be assumed to be their default values in version 0.81 of the XGBoost Python package.

Nôl i'r tabl cynnwys

12. Appendix D: Model performance

Nôl i'r tabl cynnwys

Manylion cyswllt ar gyfer y Erthygl

Vahé Nafilyan, Luke Lorenzi
vahe.nafilyan@ons.gov.uk, luke.lorenzi@ons.gov.uk
Ffôn: +44 (0)1633 455046