1. Main points
Census 2021, as with any census, was subject to non-response and incorrect responses.
The Office for National Statistics (ONS) developed on our successful approach to estimating coverage error from 2011 and built on the improved processing of Census 2021.
We used logistic regression models rather than a stratified approach to dual system estimation, allowing us to account for the effect of many more characteristics on response.
The results then went through a thorough quality assurance process, with adjustments made where necessary; the census count was 97% of the final estimate.
2. Summary
The Census of England and Wales provides an accurate, comprehensive and consistent picture of the England and Wales population, as laid out in Design for Census 2021. The key aim of the Census is to produce high quality population counts, at subnational and national level, for demographic characteristics, which include:
age
sex
ethnicity
tenure
accommodation type
economic activity
Despite the best effort made by census data collection operations to count everyone, the complexity and size of the population results in census coverage errors. The most prevalent coverage error is when a member of the target population is missed in the census (undercoverage). Less often, but still at a non-ignorable rate, a member of the target population is either duplicated or counted not at their usual residence (overcoverage). The ONS uses a variety of statistical methods to estimate these coverage errors to produce corrected population totals for local authorities by the key demographic characteristics. These estimates in general have higher accuracy than the raw census counts.
The Census Coverage Survey (CCS) was used to estimate the census coverage error. The data from this survey are then linked to the census data. A combination of capture-recapture, analysis of complex survey data, and small area estimation methods are used. Finally, a bias adjustment process is used to adjust for issues which cannot be accounted for in the main design. These can occur as the result of some of the statistical assumptions not being practically attainable in complex data collection exercises like the CCS and census. Variance estimation methods are used to assesses the uncertainty around these estimates.
Census data collection covers both the general household population and managed residential accommodation, known as communal establishments. The general population itself is the population of households and the population of individuals in these households. This methodology focuses on the 2021 Census coverage estimation for the general population, details of coverage estimation for communal establishments will be published in early January 2023.
Nôl i'r tabl cynnwys4. Census coverage estimation methodology
General population undercoverage estimation and adjustment
Once the coverage survey and census data are ready for estimation and matching complete, within sampled areas we know for each CCS record whether a census response exists or not. This can be used to model the coverage probability given the set of observed census variables (or predictors) and their combinations (interactions).
Logistic regression is a powerful and well-understood tool for modelling probabilities. If modelling is done appropriately, it is possible to increase the precision of estimates thanks to using a large sample. We can also reduce certain errors by simultaneously controlling for several important variables and some of their interactions. The approach uses the entire dataset to relate a combination of demographic and other characteristics to the estimated probability that a member of the population with such characteristics will respond to census. Say, for example, a person who has the characteristics:
aged 32 years
male
white
living in a rented purpose built flat
looking for a job
in a household size of two residents
related to somebody in the household
in hard-to-count 3
in South West of England
in a self-contained accommodation
born in the UK
not a student
in an area that received access codes
observed census return rate 0.965
This member of the population would have the estimated census response probability 0.95. While, a person with exactly the same characteristics except being female would have the estimated probability 0.953.
The model predicts the probability for each combination of variables. Therefore, each observed census record gets a corresponding census response probability. This probability can be transformed to a coverage weight by taking a reciprocal of the probability. In the example previously, we will have weights 0.95-1 ≈ 1.053 and 0.953-1 ≈ 1.049, respectively. If we observe 1000 individuals in the census data like the first person in the above example, we can sum up the corresponding weights to estimate the population total for individuals with such characteristics: 0.95-1× 1000 ≈ 1053. Similarly, if we observe 1000 individuals in the census data as the second person, the corresponding estimate is 1049. Of course, we are never interested in such specific set of characteristics, but rather something more useful, say, age-sex group by local authority. Since all census records are weighted, it is possible to produce the undercoverage adjusted total for any group of interest.
Mixed effects logistic regression was used both for household and person undercoverage estimation. It is similar to the model described above, but has the local authority as a "random effect". This random effect allows the model to reflect the area specific variability in a more efficient way than having local authority as a fixed effect (like all the variables in the example shown). Without the random effect the probabilities for these two persons would be 0.95 and 0.953, no matter what local authority within the region these two person were located in. However, with random effects, those probabilities would be, say, 0.962 and 0.964 in local authority A, while 0.934 and 0.938 in local authority B. Mixed effects based estimation reflects local differences, but comes with the cost of increased variability of estimates.
Undercoverage was estimated and corrected for both the household totals and person totals. The general approach was the same in both cases, though the actual models are different. In the case of the household estimation, there was an additional adjustment for the distribution of household size. Model selection for two populations was run independently, but we tried to have as much consistency as possible in terms of levels of variables and interactions used. The two populations are 'reconciled' by the adjustment process, further information on the adjustment process will be published in winter 2022.
Figure 1a: Age-sex undercoverage probabilities (female)
England and Wales
Source: Office for National Statistics, Census 2021
Download this chart Figure 1a: Age-sex undercoverage probabilities (female)
Image .csv .xls
Figure 1b: Age-sex undercoverage probabilities (male)
England and Wales
Source: Office for National Statistics, Census 2021
Download this chart Figure 1b: Age-sex undercoverage probabilities (male)
Image .csv .xlsGeneral population overcoverage estimation and adjustment
A similar approach to undercoverage estimation was used for overcoverage estimation at person level. Overcoverage occurs when a member of the census population is either enumerated:
more than once
in the wrong location
despite not being a member of the target population (e.g. individuals born after census day)
because of a completely fictitious census return
Where possible, data cleaning resolved erroneous records (Remove False Persons) and multiple responses at the same location (Remove Multiple Responses). As such, in overcoverage estimation we only estimate for individuals enumerated more than once or enumerated in the wrong location.
Instead of modeling the coverage probability of those in the Census, overcoverage estimation was used to estimate the probability of correct enumeration in the census. Much like undercoverage estimation, the linked census and census coverage survey allowed each linked record to have an outcome of 0 or 1, depending on if they were correctly enumerated or not. It is important here to assume there is no overcoverage in the census coverage survey, as it is used as the correct location of census individuals. This is assumed due to the way the census and census coverage survey are designed, where the time between the collection of them is designed to be large enough to optimise response rates but to reduce movement in the population. This linked outcome was used to model the probability of correct enumeration, using a fixed effects, logistic regression model. Both numerical issues in the model fitting process and timescales meant that random effects were not included in this model.
Using the same example of characteristics, that person might have the estimated census correct enumeration probability 0.995. A person with exactly the same characteristics except being female, has the estimated correct enumeration probability 0.9953.
In the same way, this overcoverage model then produces a correct enumeration probability for each combination of variables. Therefore, each observed census record gets the corresponding census response probability and correct enumeration probability. The response probability can be transformed to the coverage weight by taking a reciprocal of the probability. However, for overcoverage estimation, the aim is to down-weight the census estimate and therefore the undercoverage weights are multiplied by the correct enumeration probabilities. Where undercoverage error is estimated and correcting for overcoverge error, we will have weights (0.995x0.95-1 )≈ 1.047 and (0.995x0.953-1 )≈ 1.044, respectively. If we observe 1000 individuals in the census data, we can sum up the corresponding weights to estimate the population total for individuals with such characteristics: (0.995x0.95-1) x 1000 ≈ 1047.37. Similarly, if we observe 1000 individuals in the census data as the second person, the corresponding estimate is 1044.
Similarly, to 2011, matching of the census dataset to itself allowed for stronger estimates of the level of duplication, with high precision for each of 17 pre-specified groups within each region across England and Wales. This method is outlined by Census to census matching strategy 2021. This census to census linkage exercise, enabled the estimated proportions of duplication across regions and groups to be estimated with high precision. The estimated proportions of duplication found within each group in each region were then used to calibrate the estimated probabilities of correct enumeration calculated by the model, to produce the final correct enumeration probabilities for each census record. Further information is available in The Proposed Duplication Calibration Method for the 2021 Census of England and Wales.
The estimated level of overcoverage in the 2021 Census of England and Wales was 0.96%, compared to the 2011 Census of England and Wales where the estimated level of overcoverage was 0.6%.
Figure 2a: Age-sex correct enumeration probabilities (female)
England and Wales
Source: Office for National Statistics, Census 2021
Download this chart Figure 2a: Age-sex correct enumeration probabilities (female)
Image .csv .xls
Figure 2b: Age-sex correct enumeration probabilities (male)
England and Wales
Source: Office for National Statistics, Census 2021
Download this chart Figure 2b: Age-sex correct enumeration probabilities (male)
Image .csv .xls5. Bias adjustments
Based on the experience of the previous censuses, some of the assumptions needed to produce the coverage adjusted population estimated with ignorable levels of bias may not be met in practice. Therefore, the development of methods to adjust for certain biases is also a part of the coverage estimation. In addition, some ad-hoc adjustments may be implemented based on the quality assurance results and availability of the data.
Producing coverage error corrected population size estimates using the Census Coverage Survey (CSS) and census data requires independence between these two data sources. Independence means that for every member of the target population a chance of responding to the coverage survey does not depend on the member being census respondent or non-respondent. In practice, such independence is not achievable and a dependence bias adjustment may be needed. In general, non-responders to census may be less likely to respond to the coverage survey, which would bias the estimates downwards - leading to estimates that are too low. This is the type of bias for which correction was planned and prepared for in advance.
To do this, an alternative estimate was needed. Similar to previous censuses, an Alternative Household Estimate was calculated Alternative Household Estimate 2021. However, since the coverage estimation in 2001 and 2011 Censuses used dual system, ratio, and synthetic approaches, while the coverage estimation in 2021 Census used the mixed-effects logistic regression approach. The way adjustment was applied was very different this time aroundas outlined in Adjusting for the dependence bias in the Census 2021 coverage estimation.
There are two main challenges when using the Alternative Household Estimate to correct for the dependence bias. First, the alternative estimates are available at quite high level of aggregation defined by local authority by hard-to-count index by accommodation type. The second challenge is that reliable alternative estimates are available for the household population only, whereas a dependence bias adjustment is required both for the household and person populations.
There were several dependence bias adjustment methods designed and tested at the research stage for the 2021 Census. The approach chosen was the direct adjustment method with reweighting (apportionment) based on the initial undercoverage probabilities.
In 2011, all local authorities were adjusted for dependence bias. However, in the 2021 only five were adjusted.
Another adjustment made was a single year of age adjustment for those aged zero to three years. Based on quality assurance, it was decided to adjust for these age groups across all local authorities in England and Wales using administrative data. In addition, those aged 4 to 15 years were adjusted in Wales and North East based on the School Census.
Nôl i'r tabl cynnwys6. Other quality assurance adjustments
Several other adjustments were made. There were 15 local authorities in undercoverage estimation were the random effect was forced to be 0 and only the fixed effects part of the model was used. These was due to the fact that the Coverage survey in those areas was not of sufficient quality to reliably support the mixed effects logistic approach, while switching to the logistic regression for the entire country would have had a negative effect for many other local authorities.
The estimated person coverage probabilities in several local authorities were constrained to the household coverage probabilities at the local authority by hard-to-count by accommodation type level. The reason was not directly related to estimation. In this case, the combination of person and household level estimates meant that adjustment process might have experience difficulties. After a careful consideration and assessing the impact, the decision to constrain the probabilities was made.
Nôl i'r tabl cynnwys7. Variance estimation
Variance estimation measures the variability of the estimates for the key domains of interest like person / household local authority total, local authority by age-sex group total, local authority be tenure. This is outlined in Variance Estimation for 2021 Census Population Estimates. Similarly, to the 2011 Census, the bootstrap method is used. However, unlike the previous census, the bias corrected percentile method was used to produce confidence intervals. This allowed reflecting non-symmetric distribution of the coverage error corrected estimates.
Nôl i'r tabl cynnwys9. Cite this methodology
Office for National Statistics (ONS), released 9 November 2022, ONS website, methodology article, Coverage estimation for Census 2021 in England and Wales