1. Introduction

Administrative data can provide information on housing characteristics. Our previous assessment of the potential uses of Valuation Office Agency (VOA) dataset identified that they are a primary data source to be incorporated into the design and development of Census 2021.

Incorporating VOA data into the census requires linking the two data sources together. However, this will result in households that are missing the number of rooms because of either a failure to link to VOA data or missing information in the VOA data. To ensure that the quality of Census 2021 outputs are fit for purpose, an edit and imputation process is required to deal with these issues. In this working paper, we present the research carried out to firstly ensure VOA data are suitable to undergo edit and imputation and secondly to develop the methodology. This includes a demonstration of edit and imputation for linked 2011 Census and VOA data for the local authorities with the lowest linkage rate. This working paper is an extension of the research presented to the Census Methodological Assurance Review panel. In response to the panel's feedback additional records are included in the research presented here. A summary report of the main findings has also been published.

Based on this analysis, it is the Office for National Statistics’ (ONS) view that linked VOA number of rooms data are suitable to undergo edit and imputation for Census 2021, despite some challenges (see Section 5).

Nôl i'r tabl cynnwys

2. Things you need to know about this release

We are transforming the way we produce population, migration and social statistics to better meet the needs of our users and to produce the best statistics from all the available data. This includes the use of alternative data sources to provide information on number of rooms in Census 2021 (see Section 3). More information about our plans to do this and how we are progressing a programme of work to put administrative data at the core of population, migration and social statistics is available.

We welcome users providing feedback on this research and the methodology used to produce it, including how it might be improved and potential uses of the data. Please email your feedback to admin.based.characteristics@ons.gov.uk and include “Housing” in the subject line of your response.

Editing and imputation is the process of identifying and treating errors in data. Errors in Valuation Office Agency (VOA) number of rooms data refer to the number of rooms value being missing in the original VOA data prior to linkage with the census or because of failure to link VOA data with the census. Editing and imputation also treats inconsistencies in data, for example, where VOA number of rooms data conflict with other census information. VOA number of rooms data may be removed and imputed where these conflicts occur. This will ensure VOA data are consistent with the census data, without altering census data. If uncorrected, these missing values and errors can lead to biased and/or implausible outputs.

The 2011 Census was linked to the VOA data to demonstrate the viability of this approach for the linked VOA number of rooms variable. The 2011 Census used the Canadian Census Edit and Imputation System (CANCEIS), which is an implementation of donor-based imputation. More information about the VOA data can be found in the source overview, and the summary of the quality assurance we have undertaken on it for Census 2021 is available. Our method for linking the two data sources is described in Section 4.

It is important to note that these findings only apply to the VOA number of rooms variable. They should not be taken as a general endorsement that all linked administrative data are suitable for editing and imputation. We recommend that the feasibility research shown here should be repeated on any administrative data variable that is intended to be linked to the census or that is under consideration to replace a census variable in the future.

Nôl i'r tabl cynnwys

3. Background

A Census 2021 topic consultation recommended to continue collecting information on the number of bedrooms from the census, as this is used to derive measures of overcrowding and under-occupancy. For example, the bedroom standard is used by central and local government for housing planning strategy and resource allocation. The number of bedrooms data are considered to be more straightforward for respondents to answer compared with number of rooms, as evidenced in the 2011 Census Quality Survey (CQS) (PDF, 1.4MB) (67% accuracy of responses for number of rooms and 91% for number of bedrooms).

Number of rooms on the census primarily meets the same information need as number of bedrooms. The Office for National Statistics’ (ONS’) intention to reduce respondent burden by using alternative sources, specifically Valuation Office Agency (VOA) data, to provide information on number of rooms was announced in the Census 2021 White Paper, Help Shape Our Future: The 2021 Census of Population and Housing in England and Wales.

Our previous assessment of the feasibility of using Valuation Office Agency (VOA) data to replace the number of rooms question concluded that the direct agreement rate between the 2011 Census and VOA data for number of rooms was 16%. This was primarily attributable to the definitional differences between the 2011 Census and VOA rooms variables. The census included kitchens, utility rooms and conservatories in its number of rooms estimates, which the VOA data do not. Since most properties have a kitchen or cooking facilities, the number of rooms in the census data was generally higher than the corresponding number of rooms in the VOA data (see Figure 1). If we assume that the number of rooms derived using VOA data records is at least one room less than when derived using the census data, then the agreement rate increases to 48%.

Comparatively, the quality of the census responses for number of rooms was measured by the 2011 CQS at 67%. The survey found that differences occurred because respondents had misunderstood the question. Most of these differences (93%) were within plus or minus one room.

This does not mean that the VOA data are of low statistical quality. Using VOA number of rooms for Census 2021 does imply a discontinuity with 2011 Census estimates (because of the definitional difference) that users need to be aware of. It will not be appropriate to measure change in number of rooms from 2011 to 2021; instead, the census bedroom question can be used for comparisons over time. Using the number of rooms in the VOA data for Census 2021 will provide a high-quality relative measure of size enabling the comparison of households across areas within the same time period. Therefore VOA number of rooms can be used for the derivation of the Carstairs Index and Indices of Multiple Deprivation (IMD).

This is the first time we are using administrative data linked to the census to produce a census statistical output. We need to ensure administrative data are suitable to undergo edit and imputation to ensure that the quality of future census outputs is not impacted by households that are missing number of rooms.

Nôl i'r tabl cynnwys

4. Linking 2011 Census with Valuation Office Agency data

The current research linked 2011 Census data to 2016 Valuation Office Agency (VOA) data at address level using unique property reference numbers (UPRNs). Properties in the VOA data that were built after 2011 were removed prior to linkage to enable better comparison to the census data. 2011 Census records that could not be assigned a UPRN were also removed. A more detailed description of the linkage methodology can be found in Section 13.

The 2011 Census captured address information at household level. The 2011 Census defines a household as “one person living alone, or a group of people (not necessarily related) living at the same address who share cooking facilities and share a living room or sitting room or dining area”, which may not be within a self-contained dwelling.

Most residential addresses in England and Wales are used by a single household, but we identified that 1% of households had a duplicate address on the 2011 Census, which may be because there was more than one household at an address (for example, households in multiple occupation (HMOs)). In contrast, the VOA data hold information on addresses, and it is not currently possible to identify multiple households at an address in administrative data sources from address information alone. Without additional information about the residents and the relationships between them, it is difficult to tell when there are multiple households living at the same address.

Comparatively, the VOA captures address information of Council Tax units, where each unit is a self-contained dwelling and has its own kitchen. This may not fully align with the way in which addresses were captured in the 2011 Census. Because of the difference in how address information is captured between the two data sources, linkage by UPRN is not perfect (meaning we are uncertain how rooms recorded in VOA data relate to census addresses where there are multiple households). However, in the future, we will be using an address frame that incorporates administrative data sources; therefore, we would expect data linkage to improve in respect to households.

For the purpose of these analyses, 2011 Census records with duplicate UPRNs have been included. As these records are not from an administrative data source, we are able to determine multiple households at these addresses. Note that these records were not linked to the VOA data and were treated the same as census records that did not link to VOA data. If these analyses were to be repeated using only administrative data, care should be taken to ensure there is no bias against addresses with multiple households.

2011 Census records that could not be assigned a UPRN (1.7%) have been removed in these analyses as we could not link these records to the VOA data. The 2011 Census did not use UPRNs as address identifiers at the time of capture. There are methodological differences between the 2011 Census and Census 2021 in the way UPRNs are assigned, which means the number of census records that cannot be assigned a UPRN should reduce. Census 2021 will use an address frame that includes VOA data; therefore, the number of census records that cannot be assigned a UPRN should reduce in the future.

The following datasets were created to test whether VOA data are suitable to undergo edit and imputation.

Nôl i'r tabl cynnwys

5. 2011 Census edit and imputation strategy

The primary aim of the 2011 Census edit and imputation strategy (PDF, 204KB) was to produce a fully complete microdata set with no erroneous values, missing values or inconsistent responses (also known as a utility dataset). For the data under consideration here, a record would be inconsistent if number of bedrooms was greater than total number of rooms, and missing values occur because of missingness on the source data (here, Valuation Office Agency (VOA) data) or failure to link census records.

The 2011 Census edit and imputation strategy used a donor-based imputation method where records with errors (“recipients”) are assigned values from an error-free record (“donors”) in the target variable (here, number of rooms). Potential donors are selected based on having similar characteristics on “auxiliary variables” (here, other census household variables such as accommodation type, tenure or number of bedrooms) as the recipients. This method aims to retain the multivariate structure of the data. This is a standard approach to edit and imputation.

We discuss potential challenges to the assumptions underlying this method posed by linked administrative data. These assumptions include that missingness in the data is predictable from other variables in the dataset, that imputations are generated using a sensible process (for example, a valid model) and that there are sufficient donors available for all subpopulations. Linked administrative data may also pose definitional and time-frame issues. The three broad challenges considered here are:

  • Impact of record linkage failures on editing and imputation in linked administrative and census data (Section 6)
  • Missing data mechanisms for linked 2011 Census and VOA data (Section 7)
  • Relationships with 2011 Census household variables (Section 8)
Nôl i'r tabl cynnwys

6. Impact of record linkage failures on editing and imputation in linked administrative and census data

An important assumption of editing and imputation is that the pool of donors is as representative of the recipients and as large as possible. Therefore, it is necessary to check that the unlinked 2011 Census records do not form a distinct subpopulation, which could introduce bias. In this case, the auxiliary variables used to inform imputation of the unlinked 2011 Census records should be similar to the auxiliary variables of the linked records.

The unlinked census records comprised of two types: duplicate and residual records. Duplicate records were purposely not linked to Valuation Office Agency (VOA) data as we would be unable to determine the “correct” linked record. Residual records are those that failed to link to a VOA record. As such, none of the unlinked records contain VOA rooms data.

We compared the distributions of the census household characteristics in the linked and unlinked 2011 Census records. These were:

  • number of usual residents
  • tenure
  • self-contained status of accommodation
  • accommodation type
  • landlord
  • number of bedrooms
  • central heating
  • number of cars or vans

Figure 2 demonstrates how distributions of the auxiliary variables differed between the unlinked and linked records for 2011 Census accommodation type (see Section 14 for other census household variables). Generally, the linked records contained higher proportions of larger terraced and semi-detached properties that were more likely to be owned outright or with a mortgage or loan, whereas the unlinked records contained higher proportions of properties such as caravans, rented converted houses and flats with small numbers of bedrooms.

This means that the unlinked 2011 Census records may form a distinct subpopulation from the linked records. However, all auxiliary variable categories were represented in both linked and unlinked records. Despite some differences, the characteristics of unlinked records are sufficiently represented in the linked records to account for them with appropriate selection and weighting of the auxiliary variables. Hence, they provide a suitable donor pool.

Nôl i'r tabl cynnwys

7. Missing data mechanisms for linked 2011 Census and VOA data

Data are Missing At Random (MAR) if the probability of a missing value can be predicted by values of auxiliary variables. This is considered an “ignorable” missing data mechanism for editing and imputation as potential bias from imputing MAR can be mitigated by using an appropriate imputation model. The assumption that data are MAR was tested by examining whether missingness in Valuation Office Agency (VOA) number of rooms data could be predicted from census household variables.

Failure to link is not the only source of missingness in the full dataset, which also contains missingness in the VOA data prior to delivery to the Office for National Statistics (ONS). The percentage of records missing VOA number of rooms in the linked dataset was 1.1% compared with 6.4% in the full dataset. This is because of the addition of the unlinked census records that did not contain any information from the VOA.

Therefore, to ensure the full dataset is MAR, we must also check whether overall missingness in number of rooms is predictable from census household information. To do this, we examined the distribution of census household variables in the full dataset1.

The levels of missing data in VOA number of rooms differed according to the values of other census household variables in the full dataset, suggesting a MAR mechanism (see Figure 3 for 2011 Census accommodation type and Section 15 for other census household variables). That is, census household variables were able to predict missingness in VOA number of rooms.

Notes for: Missing data mechanisms for linked 2011 Census and VOA data

  1. Different missing data mechanisms can operate in administrative compared with survey data. Investigating missing data mechanisms in only the linked dataset would not have been sufficient as missingness resulting from linkage failure would not have been accounted for.
Nôl i'r tabl cynnwys

8. Relationships with 2011 Census household variables

Because of the definitional differences between the 2011 Census and Valuation Office Agency (VOA) number of rooms variables, they do not measure the exact same statistical concept. As the VOA number of rooms is intended to replace the census number of rooms, for these analyses we have adopted the VOA definition. It is vital for the quality of the current edit and imputation method that VOA number of rooms data have similar statistical relationships with other household variables as the 2011 Census number of rooms variable.

Edit rule violations (relationship to number of bedrooms)

For most addresses, the number of rooms in VOA data will be at least one less than in the 2011 Census records. This could lead to an increase in the number of addresses where the number of bedrooms is greater than or equal to the number of rooms. Any records where the number of bedrooms is greater than the number of rooms would fail edit rules used in the 2011 Census edit and imputation strategy (PDF, 204KB).

In the 2011 Census, 0.2% of records had a greater number of bedrooms than number of rooms. In VOA data, the number of records that failed the edit rule was higher, at 1.2%. Despite the increase, this should not cause problems for edit and imputation as it is still a small percentage of all records.

We also investigated the number of properties where the number of bedrooms was equal to the number of rooms: 1.6% of records in the 2011 Census had an equal number of rooms and bedrooms, compared with 6.4% in VOA data. This is because of the number of rooms in VOA data typically being one less than in 2011 Census data. This will have implications for the continuity of outputs. We welcome user feedback on this.

Association with other census household variables

We examined the relationship between both census and VOA number of rooms data and auxiliary household variables using several simple regression analyses. Census household variables were used as predictors and number of rooms were outcome variables1. Separate analyses were performed for both the census and VOA versions of number of rooms. Nominal household variables were dummy coded. If the obtained coefficients representing the relationship between rooms and other variables are similar for both versions of the rooms variable, it can be inferred that they are measuring similar statistical concepts.

The relationship between household variables and the 2011 Census rooms variable appeared similar to the relationship with the VOA rooms variable across all of the census household variables (see Table 2 for 2011 Census accommodation type and Section 16 for other census household variables). The number of rooms predicted using the VOA rooms variable was typically one less than the number of rooms in the census. This is consistent with the definitional differences between the two sources. This means that an imputation model suitable for the 2011 Census number of rooms variable should also be suitable for the VOA number of rooms variable.

Notes for: Relationships with 2011 Census household variables

  1. The census household variables are:

    • number of usual residents
    • accommodation type
    • landlord
    • tenure
    • number of bedrooms
    • central heating
    • number of cars or vans
Nôl i'r tabl cynnwys

9. Demonstration of donor-based imputation using linked 2011 Census and VOA data

Data were imputed using a similar strategy as detailed in the 2011 Census item edit and imputation process report (PDF, 204KB). The 2011 Census imputation tool is the Canadian Census Edit and Imputation System (CANCEIS). We focused on the 10 local authorities1 with the highest percentage of missing data (between 18% and 45.8%) in the Valuation Office Agency (VOA) number of rooms variable in the full dataset. These combined to a total of 1.7 million addresses, covering both urban and rural geographies.

The current Census 2021 processing plan is that census data will be processed first and administrative data will be linked second. Therefore, 2011 Census household variables were jointly imputed first, followed by a single imputation on VOA number of rooms data using fully imputed census data as weighted auxiliary variables.2 CANCEIS also imputes records that violate edit rules. For any record where number of bedrooms was larger than VOA number of rooms, the number of rooms value was removed and a new one imputed. This protects the observed census data and ensures they cannot be changed based on what has been observed in the administrative data. This keeps to the edit rule principle that where any conflicts between census and administrative data arise, the census variables should be favoured.

The census number of rooms variable was not imputed; therefore, we could not compare its distribution to VOA number of rooms data. Instead, we compare the post-imputation distribution of VOA number of rooms data to the pre-imputation distribution.

As Figure 4 shows, the VOA post-imputation distribution appears similar to the pre-imputation distribution (see Section 18 for distributions by individual local authority). This is in line with what is expected in a successful imputation given what we know about the missingness mechanisms, demonstrating a slight increase in the proportion of households with fewer rooms. This gives us confidence that linking VOA number of rooms data with Census 2021 data, and using census household variables in the imputation model as these characteristics, are able to predict both missingness and values of the VOA rooms variable despite definitional differences between the two data sources.

There were no edit rule violations in the post-imputation distribution. However, 14.6% of records had equal values for VOA number of rooms and census number of bedrooms. This was 2.3 times larger than the pre-imputation distribution.

Following imputation, the relationship between the census household variables and VOA number of rooms data was re-examined. There was little change to the relationships, and the measures appeared similar to the pre-imputation relationships (see Table 3 for 2011 Census accommodation type and Section 17 for other census household variables). This suggests that imputing VOA data after processing census data is viable while aligning with the principle to favour census data over administrative data where there are conflicts between the two.

Notes for: Demonstration of donor-based imputation using linked 2011 Census and VOA data

  1. The local authorities with the most missing data in the full dataset (linked 2011 Census and Valuation Office Agency (VOA) data, plus 2011 Census residual and duplicate records) were:
    • Isles of Scilly (45.8%)
    • Hammersmith and Fulham (37.6%)
    • Kensington and Chelsea (37.5%)
    • Westminster (29.4%)
    • Islington (29.4%)
    • Ceredigion (24.3%)
    • Camden (23.1%)
    • Haringey (21.8%)
    • Gwynedd (18.8%)
    • Southwark (18.0%)
  2. The highest weighted auxiliary variables were: number of usual residents, accommodation type, number of bedrooms and tenure.
Nôl i'r tabl cynnwys

10. Summary

The Office for National Statistics (ONS) is committed to replace the number of rooms question on Census 2021 using Valuation Office Agency (VOA) data. To ensure the quality of Census 2021 outputs are fit for purpose, an edit and imputation process is required to deal with households that are missing the number of rooms (because of either a failure to link to the VOA or missing information in the VOA data).

We demonstrated this by linking the 2011 Census and VOA records built prior to 2012 at address level, using the unique property reference number (UPRN). Records with duplicate UPRNs, which may be indicative of houses in multiple occupation (HMOs), have been included in these analyses. This differs from our previous publications, which removed records with duplicate UPRNs. The aim of edit and imputation is to create a full dataset with no missing data, and removing duplicated records would mean the data may not capture certain types of households.

We have shown that VOA number of rooms data, when linked to the census, can provide a high-quality relative measure of size especially when compared to the 2011 Census response. The quality of the 2011 Census response for number of rooms was measured by the Census Quality Survey (CQS) at 67%. Although there will be a discontinuity between 2011 and 2021 number of rooms because of the definitional differences, the data quality will be higher and there will be no burden on respondents to answer an additional question.

The VOA number of rooms variable was edited and imputed after processing 2011 Census data, following the current 2021 processing plan, which will use the Canadian Census Edit and Imputation System (CANCEIS). The results suggest that:

  • it is feasible that linked VOA number of rooms can be imputed for 2011 Census household records, despite some challenges (see Section 5)
  • it is possible to predict VOA number of rooms from census variables both when there were missing data prior to linkage and missingness because of data linkage failure
  • the 2011 edit and imputation method is robust enough to handle administrative data for number of rooms while favouring survey data over administrative data where the two are inconsistent

Distributions of missing VOA number of rooms values differed by auxiliary variables. This is indicative of a Missing at Random (MAR) mechanism in the data. Therefore, an appropriate imputation strategy can be designed to limit bias in the data. Our analysis focused on the 10 local authorities with the highest percentage of missing data (between 18% and 45.8%). This demonstrated the robustness of the proposed edit and imputation strategy.

There was an increase in inconsistent VOA data compared with 2011 Census data (for example, where number of bedrooms was greater than number of rooms), although this was within acceptable limits and could be corrected by editing and imputation. The relationship that the VOA and 2011 Census number of rooms variables have with auxiliary variables was similar, indicating that both VOA and 2011 Census number of rooms variables measure similar statistical concepts.

Crucially, these results apply to only the VOA number of rooms variable. They should not be seen as a general endorsement that all linked survey-administrative data are suitable to undergo edit and imputation procedures. We recommend that the kind of feasibility research shown here should be conducted on any administrative data variable that is intended to be linked to the census or that is under consideration to replace a census variable in the future.

Based on this analysis, it is the ONS’ view that linked VOA number of rooms are suitable to undergo edit and imputation for Census 2021, despite some challenges.

Nôl i'r tabl cynnwys

11. Further research

The aim of the edit and imputation strategy for the census is to provide a full and complete dataset. This research demonstrates a worst-case scenario. In the future, data linkage between Valuation Office Agency (VOA) and census data should improve. There are methodological differences between the 2011 Census and Census 2021 in the way unique property reference numbers (UPRNs) are assigned. Our current research removed 1.7% of 2011 Census records that could not be assigned a UPRN. Census 2021 will use an address frame that has an implicit link to VOA data for most records; therefore, the number of census records that cannot be assigned a UPRN and be linked should reduce.

There were time frame differences between the 2011 Census and VOA data used in the current research, which was a snapshot capture from 2016 and filtered to only include addresses built before 2012 to enable a better comparison to census data. Some error in the linked data may be attributed to this difference. In the future, we will have VOA data that align with the census data collection period. This will likely reduce any discrepancies resulting from the data being captured in different years, hence reducing overall rates of error.

We aim to repeat these analyses using 2019 Census Rehearsal data linked to equivalent VOA data to ensure we are operationally ready for Census 2021.

Nôl i'r tabl cynnwys

12. Feedback

We are keen to get feedback on this research and the methodology used, including how the research might be improved and potential uses of the data. Please email your feedback to admin.based.characteristics@ons.gov.uk and include “Housing” in the subject line of your response.

Nôl i'r tabl cynnwys

14. Annex B: Distributions of linked 2011 Census–VOA and unlinked 2011 Census records, by 2011 census household variables

This annex presents distributions of linked 2011 Census–Valuation Office Agency (VOA) and unliked 2011 Census records by all census household variables that could not be reproduced in Section 6. The distribution for 2011 Census accommodation type can be found in Figure 2 in Section 6.

Nôl i'r tabl cynnwys

15. Annex C: Proportions of missingness in VOA number of rooms in the full dataset, by 2011 Census household variables

This annex presents proportions of missing Valuation Office Agency (VOA) number of rooms in the full dataset by all census household variables that could not be reproduced in Section 7. The distribution for 2011 Census accommodation type can be found in Figure 3 in Section 7.

Nôl i'r tabl cynnwys

16. Annex D: Pre-imputation regression results predicting 2011 Census and VOA number of rooms from 2011 Census household variables

This annex shows results for pre-imputation regressions predicting 2011 Census and Valuation Office Agency (VOA) number of rooms from 2011 Census household variables that could not be reproduced in Section 8. The results for 2011 Census accommodation type can be found in Table 2 in Section 8.

Nôl i'r tabl cynnwys

17. Annex E: Pre- and post-imputation regression results predicting VOA number of rooms from 2011 Census household variables

This annex shows results for pre- and post-imputation regressions predicting Valuation Office Agency (VOA) number of rooms from 2011 Census household variables that could not be reproduced in Section 9. The results for 2011 Census accommodation type can be found in Table 3 in Section 9.

Nôl i'r tabl cynnwys

18. Annex F: Pre- and post-imputation distributions of VOA number of rooms, at local authority level

This annex presents pre- and post-imputation distributions of Valuation Office Agency (VOA) number of rooms for the 10 local authorities with the most missing data in the full dataset (linked 2011 Census–VOA, plus 2011 Census residual and duplicate records). The overall pre- and post-imputation distributions of VOA number of rooms can be found in Figure 4 in Section 9.

Nôl i'r tabl cynnwys

19. Contact details for this methodology

This research was produced in collaboration between Method, Data and Research Division (Andy Mealor, Anna Summerbell, Fern Leather) and Statistical Design and Research Division (Sarah Collyer, Ali Dent, Stephan Tietz).

Email: admin.based.characteristics@ons.gov.uk

Nôl i'r tabl cynnwys