Research Outputs: An update on developing household statistics for an Administrative Data Census

1. Disclaimer and feedback

The Research Outputs are NOT official statistics on the population. Rather they are published as outputs from research into an Administrative Data Census approach. These outputs must not be reproduced without this disclaimer and warning note, and should not be used for policy- or decision-making.

If you have any questions or feedback please email Admin.Data.Census.Project@ons.gov.uk (and include the subject line “Research Outputs feedback”).

Nôl i'r tabl cynnwys

2. Main points

Following the publication of our Research Outputs on occupied address (household) estimates from administrative data: 2011 and 2015, we’ve investigated using a coverage survey to improve estimates for the number of occupied addresses in England and Wales by local authorities.
Using a dual system estimation (DSE) method, our 2011 coverage-adjusted estimate of 23.4 million occupied addresses is 0.29% higher than the 2011 Census estimate for the number of households in England and Wales.
This is an improvement on our previous unadjusted estimate of 22 million occupied addresses, which was 5.9% lower than the 2011 Census estimate for the number of households in England and Wales.
More detailed research needs to be undertaken to understand potential sources of bias in the DSE approach and to support development of a Population Coverage Survey (PCS) to adjust for coverage biases on administrative data for households.
This publication includes our first attempt at using administrative data to produce distributions of how households are made up for England and Wales and by local authorities; this initial research has produced encouraging results.
A framework for evaluating the quality of household composition estimates is yet to be developed; however, distributions for the majority of categories are comparable with 2011 Census estimates.
Following the publication of our Research Outputs on occupied address (household) estimates by size: 2011, we also include household size estimates for 2016, with comparisons with estimates from the Annual Population Survey (APS) at national level.

Nôl i'r tabl cynnwys

3. Things you need to know about this release

In this release, we use the term “households” when referring to the estimates that have been produced for these Research Outputs. These estimates are actually based on the concept of “occupied addresses” from administrative data, which is different from traditional “household” definitions used in censuses and surveys. The main aim of these outputs is to highlight what can currently be achieved using administrative data to meet the traditional definition of “household”.

For this release, we’ve attempted to identify and remove communal establishments by using address classifications on Ordnance Survey’s AddressBase¹ product. This is more consistent with the definitions used in household statistics produced from the census and other social surveys. It may also explain some of the changes observed in the estimates published in this release compared with those published last year.

The administrative data Research Outputs presented in this release have been developed from the same population base used in our previous release – Statistical Population Dataset (SPD) Version (V)2.0. The SPD V2.0 is produced by anonymously linking records at person and address level from various administrative data sources. These include the NHS Patient Register (PR), the Department for Work and Pensions (DWP) Customer Information System (CIS), data from the Higher Education Statistics Agency (HESA) and England and Wales school census data.

Notes for Things you need to know about this release

AddressBase – an Ordnance Survey address product compiled from local authority, Ordnance Survey and Royal Mail address lists.

Nôl i'r tabl cynnwys

4. Background

Detailed statistics about households are currently produced from the census every 10 years. In between census years, official household statistics are produced at national level using survey data that have an insufficient sample size to produce reliable small area household statistics. One of the advantages of a future Administrative Data Census is the potential to produce detailed household statistics on a regular basis for small geographies.

In our previous release Occupied address (household) estimates from administrative data: 2011 and 2015, we outlined a number of challenges associated with the use of administrative records in producing household statistics. These were summarised as the following.

Definitions of households

Address information in administrative data is generally collected from individuals registering for services, whereas surveys and censuses are designed to collect targeted information about households. This makes it challenging to meet the traditional definition of “household” using administrative data.

Address matching

As we’re using addresses as the basis for identifying and grouping individuals into households, our method depends on linking address information on administrative records to Ordnance Survey’s AddressBase product. While this enables us to standardise address information and use Unique Property Reference Numbers (UPRNs) to produce household statistics, approximately 4% of administrative records can’t be matched to a UPRN. Reasons for these non-matches are referred to in our previous release. Mostly, it’s due to the absence or insufficient quality of address information held in administrative sources.

Complex addresses

Some addresses require a UPRN hierarchy to identify where there are multiple dwellings within the same property. For example, purpose-built flats often have a “parent UPRN” for the entire building and “child UPRNs” for each flat within the building. In some situations, for example, in some student halls of residence, this hierarchy isn’t available and only the “parent UPRN” is identifiable as a single occupied address. For example, this means we might count one UPRN where we should be counting 30 child UPRNs.

Population exclusions

Currently we only include occupied addresses in our household statistics when there is evidence that they’re likely to be occupied by persons in the “usually resident”¹ population. More information about our inclusion rules in the construction of Statistical Population Dataset (SPD) Version 2.0 are available in our Research Outputs on estimating population size using administrative data. In some instances, we may incorrectly determine that persons aren’t part of the usually resident population and consequently exclude the address they’re occupying from our household statistics.

The issues described in this section list only some of the challenges associated with using administrative records for household statistics. However, each of these were identified as a contributing factor in underestimating the number of occupied addresses when compared with the 2011 Census estimates.

One way of improving on these estimates is to make use of a coverage survey, similar to the methodology used in the 2001 and 2011 Censuses. Our unadjusted estimates for occupied addresses from administrative records in 2011 were 5.9% lower than the coverage-adjusted census estimate for the number of households. In this release, we outline how we’ve adapted the dual system estimation (DSE) census methodology to produce coverage-adjusted estimates for the number of occupied addresses. We also compare results with the 2011 Census estimates for numbers of households.

In addition to our research on coverage-adjusted outputs for 2011, we’ve also produced outputs for the number of occupied addresses for 2016. As suitable survey data aren’t currently available for the DSE approach described in the following sections, these estimates remain unadjusted but allow us to continue a time series of outputs for 2016. To supplement the time series for 2016, we are also releasing estimates of household size for 2016 using the same methodology reported in our previous publication, Occupied address (household) estimates by size, 2011.

This release also includes our first attempt at producing distributions of household composition. We present a methodology based predominantly on a rules-based approach for assigning household composition from information available on administrative records. This method is supported with an imputation approach for more complex households. We present analysis and outputs of household composition for both 2011 and 2016.

Notes for Background

Usually resident population – we are currently adopting the UN definition of "usually resident" – that is, the place at which a person has lived continuously for at least 12 months, not including temporary absences for holidays or work assignments, or intends to live for at least 12 months (United Nations, 2008).

Nôl i'r tabl cynnwys

5. Household statistics for 2011

Using DSE to estimate number of occupied addresses

The basic capture-recapture approach of dual system estimation (DSE) is to count a sample of the population once and then count a second sample (most often in the form of a follow-up survey). DSE is traditionally used to account for under-coverage in the form of non-response and was used in the 2001 and 2011 Census coverage-adjustment process.

The statistical assumptions of DSE are:

there is perfect matching between addresses on the Statistical Population Dataset (SPD) and survey using a high-quality unique identifier
the address frame used is complete and of high quality -there are no erroneous records on the SPD or survey (see Coverage-adjusted administrative data population estimates for England and Wales, 2011 for explanation of “erroneous”)
the SPD and survey, in this case, the Population Coverage Survey (PCS), are independent

As described in Occupied address (household) estimates from administrative data: 2011 and 2015, we’ve developed an automated methodology that links address records to AddressBase to assign Unique Property Reference Numbers (UPRNs). Only records that have been assigned to a UPRN via a successful match to AddressBase are included in the estimation and therefore we can assume a high quality of matching between addresses and the SPD.

AddressBase contains a list of residential addresses complied from local authority, Ordnance Survey and Royal Mail address information. Information about addresses is routinely updated on AddressBase and provides a comprehensive list of residential addresses in England and Wales. We intend to include an address check as part of our PCS test in summer 2018. This is to test our assumption that the coverage of addresses on AddressBase is accurate for estimating numbers of occupied addresses.

Erroneous records and their prevalence in administrative data was a particular focus in our recent Research Output on Coverage-adjusted population estimates, England and Wales, 2011. Here we describe in detail the issue of administrative data “over-coverage”. When using DSE to estimate the number of occupied addresses, over-coverage is only a problem if the addresses people have moved out of remain unoccupied. There are instances of vacant addresses appearing as occupied in administrative sources. However, this type of over-coverage will be less common than the over-coverage affecting the size of the population, where the individuals need to be registered in the right place. The type of over-coverage may affect the size of households, but it has little impact on the number of households.

The fourth assumption of DSE, which requires that the two sources used are collected independently, is arguably less of an issue when combining a coverage survey with administrative data in a DSE framework. This is because administrative data are collected independently by government departments and therefore non-registration is less likely to be indicative of non-response to coverage surveys. This assumption needs to be tested and we’re currently identifying suitable datasets and a framework for measuring dependence between survey response and administrative data registration.

Our Statistical Population Dataset (SPD Version (V)2.0), which is constructed by linking multiple administrative datasets, is used as the basis for the occupied address estimates we’ve produced in this Research Output. As mentioned in Section 4, we’ve used classifications on AddressBase to identify and remove communal establishments from SPD V2.0. Only records with UPRNs are included in the coverage-adjustment process.

Figure 1: Overview of the coverage adjustment methodology

Linking administrative data and other sources adjusted for coverage errors produce population and household estimates

Source: Office for National Statistics

Download this image Figure 1: Overview of the coverage adjustment methodology

.PNG (38.5 kB)

For coverage adjustment, we’ve sampled a Population Coverage Survey (PCS) using 2011 Census data, in a way that replicates the sample data that would typically be collected from a PCS in future (see Figure 2). This approach isn’t perfect and will be refined in future.

Our current method relies on creating a unique list of all Lower Layer Super Output Areas (LSOAs) in England and Wales, which we use as our Primary Sampling Unit (PSU) frame. We then randomly select 4% of these PSUs in every local authority and create a unique list of the Output Areas (OAs) within these LSOAs to form our Secondary Sampling Unit (SSU) frame. Our final sample is drawn by taking 25% of these SSUs to give us a 1% sample of OAs in England and Wales.

Figure 2: Overview of Population Coverage Survey sampling

A random 1% sample of output areas is drawn from census data and used as the PCS in this method.

Source: Office for National Statistics

Download this image Figure 2: Overview of Population Coverage Survey sampling

.PNG (85.3 kB)

This approach ensures that the sample is spread across all local authorities in England and Wales and overall sample size is similar to what we expect to achieve in a future PCS. Figure 1 shows an overview of the methodology we use to produce our household estimates by local authority. For a more detailed description of how we have calculated DSE for occupied addresses at local authority level, please see Annex 1.

The method we’ve used is based on aggregating counts of occupied addresses across OAs that have been selected for the PCS within each local authority. By observing the number of occupied addresses that are counted in both the PCS and the SPD in the sampled areas, a dual system estimate is derived. This adjusts for occupied addresses that are missing from the SPD.

Within the sampled areas, the adjusted DSE total is compared with the unadjusted SPD total to derive a “ratio weight”. This is combined with the sample weight for each OA, which provides the basis for producing estimates for the number of occupied addresses in each local authority. See Annex 1 for a detailed description of this method.

To evaluate the quality of these estimates, a measure of variability is required. Since we are using 2011 Census data, repeated samples of data can be drawn using the process described previously to generate a PCS covering approximately 1% of England and Wales in each sample. In this study, we’ve repeated this process 100 times to obtain a distribution of estimates for each local authority on which to evaluate DSE performance.

Analysis of coverage-adjusted estimates

We use the following performance measures to assess the DSE for estimating the number of occupied addresses. These are:

relative bias (RB) – the percentage difference from the true population values; a small RB means that the estimate is close to our true population (official census household estimates)
relative standard error (RSE) – the mean variability of the estimates over the 100 simulations
relative root mean squared error (RRMSE) – a measure of the accuracy of the estimates, taking into account both RB and RSE; a lower RRMSE value means a more precise estimate

Performance at national level

The results from applying the dual system estimation (DSE) approach outlined previously are presented in Table 1. These results show that our coverage-adjusted estimate for the number of occupied addresses in England and Wales is approximately 68,000 (0.29%) higher than the 2011 Census estimate for the number of households. This is a vast improvement on the difference observed in our previous Research Output, where the unadjusted estimate for the number of occupied addresses was 1.4 million (5.9%) lower than the 2011 Census household estimates.

Table 1: National estimate and difference from census
	Occupied address estimate	Difference from 2011 Census
SPD V2.0 Unadjusted (Research Output, February 2017)	21,980,124	-1,385,920
SPD V2.0 Coverage adjusted	23,433,814	67,770

Download this table Table 1: National estimate and difference from census

.xls .csv

Table 2 presents the results of the performance measures for the coverage-adjusted estimate at national level. The low relative bias of 0.29% is also accompanied with low measures of RSE and RRMSE of 0.26% and 0.39% respectively. However, we expect measures of variance to be lower at national level than at local authority level.

Table 2: Performance measures of national estimate (with coverage adjustment)
	Relative Bias (%)	RSE (%)	RRMSE (%)
Coverage adjusted estimate	0.29	0.26	0.39

Download this table Table 2: Performance measures of national estimate (with coverage adjustment)

.xls .csv

Performance at local authority level

At local authority level, the quality of our estimates varies. Figure 3 shows the proportion of local authorities with percentage point differences from the 2011 Census household estimates. Two distributions are plotted, the darker (blue) bars showing percentage differences for the unadjusted estimates of occupied addresses, lighter (yellow) bars for the DSE-adjusted estimates.

When the DSE adjustment is applied, a larger proportion of the 348 local authority estimates are closer to the 2011 Census household estimates. There are also far fewer local authorities in the lower tail of the distribution with occupied address estimates that are more than 10% lower than the 2011 Census household estimates.

Figure 3: Local authority distribution with difference from census

England and Wales, 2011

Source: Office for National Statistics

Download this chart Figure 3: Local authority distribution with difference from census

Image .csv .xls

We don’t currently have any quality standards for estimating the number of occupied addresses. These will be developed in future. However, for now we can use the quality standards we’re currently using to evaluate the quality of population estimates.

Table 3: Difference between adjusted and unadjusted estimates for the number of occupied addresses against quality standards
	Unadjusted estimate for number of occupied addresses		DSE adjusted estimate for number of occupied addresses
Quality standard	Cumulative number of local authorities	Cumulative percentage (%)	Cumulative number of local authorities	Cumulative percentage (%)
P1: within ±3.8%	109	31.3	336	96.6
P3: within ±8.5%	314	90.2	346	99.4

Download this table Table 3: Difference between adjusted and unadjusted estimates for the number of occupied addresses against quality standards

.xls .csv

Using the P1 and P3 quality standards, Table 3 shows that after the DSE adjustment has been applied, approximately 97% of local authorities fall within plus or minus 3.8% difference from census estimates and 99% within plus or minus 8.5%. This leaves only two local authorities with RB values outside 8.5%. These are City of London and Gwynedd.

When compared with our previous Research Output results, we see a vast improvement in meeting the quality standards, specifically P1 – an additional 227 local authorities (65%) now meet the P1 quality standard and fall within plus or minus 3.8% difference from census estimates. Because of this, when we map the adjusted results and compare with our previous results (see Figure 4), we use scales at a finer granularity from what we did in our previous Research Output.

Figure 4: Relative bias before and after the dual system estimation (DSE) coverage adjustment, England and Wales, local authority areas

The coverage adjustment reduces the relative bias in all local authorities in England and Wales

Source: Office for National Statistics

Notes:

Relative bias is the percentage difference between the Statistical Population Dataset and the 2011 Census.
Dual System Estimation is the basic capture-recapture statistical approach to estimate the size of a population.
These maps are not to be interpreted as standalone. Please see Section 5 in the Research Output: an update on developing household statistics for an Administrative Data Census, for more information.

Download this image Figure 4: Relative bias before and after the dual system estimation (DSE) coverage adjustment, England and Wales, local authority areas

.png (279.1 kB) .xlsx (22.3 kB)

The map in Figure 4 with unadjusted estimates shows that we underestimated the number of households using occupied addresses by more than 4% for the majority of local authorities. For adjusted estimates, almost half of local authorities are within 1% difference from the 2011 Census household estimates (coloured in the lightest shade, grey). Some local authorities have slightly higher estimates, which may be the result of over-coverage as described in Section 4.

The most notable regional change after the application of DSE is in London. As Figure 4 shows, every local authority in London (apart from one) was undercounting census by more than 4% prior to the coverage adjustment. After the DSE adjustment, we now see a mix of over- and under-estimation in London.

Table 4 displays the 10 local authorities with the largest percentage difference from 2011 Census household estimates using the DSE adjustment. Six of the ten local authorities show a negative bias (underestimating compared with census) and four show a positive bias (overestimating compared with census). We can’t observe all possible sources of error in the DSE adjustment. More detailed simulation studies are needed to understand how different sources of bias impact on the DSE framework for producing estimates of occupied addresses.

Table 4: 10 local authorities with the largest absolute relative bias (after coverage adjustment)
Rank	Local authority	Relative bias (%)
1	City of London	-23.7
2	Gwynedd	-8.8
3	Hastings	-6
4	Southwark	5.5
5	Lambeth	5.2
6	Rutland	5
7	Liverpool	-4.7
8	Isles of Scilly	-4.5
9	Colchester	4.2
10	Shepway	-4.1

Download this table Table 4: 10 local authorities with the largest absolute relative bias (after coverage adjustment)

.xls .csv

Figure 5 shows the 10 local authorities that had the largest percentage difference from 2011 Census household estimates in our February 2017 Research Output and the impact of our coverage adjustment. In all 10 local authorities, our estimates are closer to census after the DSE adjustment, although the magnitude of improvement varies by local authority.

Previously, all 10 of the local authorities were failing to meet the P3 quality standard (within plus or minus 8.5%.). After DSE is applied, seven of these now meet the P1 quality standard (within plus or minus 3.8%) and one meets the P3 quality standard (see Figure 5). The largest improvement is seen in Kensington and Chelsea with a move of 32 percentage points towards the official census estimate.

As mentioned previously, City of London and Gwynedd still don’t meet the P3 quality standard and both local authorities still undercount census. However, the estimates for these two local authorities considerably improve after the coverage adjustment, with improvements of 27 and 13 percentage points respectively.

Figure 5: 10 previous worst performing local authorities and the impact of the coverage adjustment

The coverage adjustment improves the previous estimates in all ten local authorities.

Source: Office for National Statistics

Download this image Figure 5: 10 previous worst performing local authorities and the impact of the coverage adjustment

.PNG (193.8 kB) .xlsx (11.2 kB)

More detailed simulations are needed to understand potential sources of bias in these outputs. Occupied address estimates for the majority of local authority estimates are reasonably close to 2011 Census estimates. However, it’s possible that the effects of positive and negative sources of bias are cancelling each other out to produce estimates that are comparable with census. Potential sources of bias are listed in the following points and can’t be directly measured in this study:

matching error in the link between address on PCS and SPD to AddressBase – some addresses on the PCS and SPD can’t be successfully linked to AddressBase to assign UPRN; these addresses are subsequently excluded from the DSE, which is a negative source of bias
missing addresses from AddressBase – until we undertake an address check (intended for summer 2018), we have some uncertainty about the coverage of AddressBase; incomplete coverage would be a negative source of bias
violating the independence assumption – we haven’t yet tested whether there is a correlation between non-registration on administrative sources and non-response to surveys; if non-registration is predictive of lower survey response, this would also be a negative source of bias
over-coverage on administrative sources – as mentioned in Section 4, vacant addresses that appear occupied in administrative records would introduce a positive source of bias because residents have moved out and not updated their address records

When designing the future PCS, we’ll make use of detailed simulation studies to understand the impact of sources of bias on a DSE framework for producing household estimates from administrative data.

Summary of DSE research

This research is our first attempt at combining survey data with administrative data to produce coverage-adjusted estimates for the number of occupied addresses. In our previous publication, Occupied address (household) estimates from administrative data: 2011 and 2015, we discussed the potential need to replace the traditional census definition of “household” with an alternative that can be more accurately met with administrative data.

We’ve started to engage with users of household statistics in more detail to understand the impact that a change in definition would have. Discussions at our recent Royal Statistical Society (RSS) workshop¹ on households and addresses identified that census household definitions have frequently changed to reflect how people live together. We also discussed the potential of moving towards approaches that have been successful in other register-based countries.

An alternative definition is likely to be needed, particularly when producing statistics about household composition and families. However, our research indicates that estimating the number of households might be achievable with an occupied address definition. This is further supported by our exploratory analysis of 2011 Census data, which indicates that the majority of UPRNs (99.6%) were occupied by a single household. However, there is still a need to estimate for houses in multiple occupation (HMOs); to do this we will consider how to make use of information that is collected about HMOs at local level.

We plan to use the findings of this research to inform the development of our PCS. In addition, we plan to use these coverage-adjusted estimates of the number of occupied addresses to adapt use of structure preserving estimator (SPREE) methods for producing household characteristic outputs. For example, we plan to use them to support the SPREE model to produce local authority estimates of household size.

Household composition

Background

While a PCS has considerable potential to improve estimates for the number of occupied addresses, producing statistics about the characteristics of households occupying addresses is more challenging with administrative data. Censuses and surveys can be designed to collect information about household members and their relationships, whereas administrative records tend to collect limited information about relationships between people living at the same address. Further complexities are also introduced when information is not up-to-date for all household members. Delays in registering change of address for example, can result in individuals being missed in household analysis, or people being incorrectly included in the same household.

More detail about the type of information we have available from administrative records can be found in our Household Composition SlideShare. In summary, we’ve combined relationship information from benefit claims with other information from across administrative sources, including pseudonymised surname comparisons and age-structures within addresses. We use this information to assign a 2011 Census household composition category to each UPRN on SPD V2.0.

Challenges associated with household composition from administrative data

In Section 4, we summarised several issues with using administrative data to produce household estimates. We’ve previously focused on how these may affect estimates of the numbers of households, but they may also affect the apparent composition of households. Even if households are correctly counted, the people assigned to a particular UPRN may not correctly represent a real household, in particular due to the following reasons.