Cataloguing errors in administrative and alternative data sources – what, when and how

1. Overview

This catalogue acts as a theoretical overview: it introduces a way to organise errors in administrative data sources when used for statistical purposes. It is intended to be of practical use for anyone interested in the statistical quality of administrative data.

This catalogue demonstrates an organisation of errors, and within each error type, outlines some possible options of how to deal with the errors when using administrative data for statistical purposes. The following error types were identified and included in the catalogue:

For each type of error, we provide a description of what the error is, an example of when it can occur, and how it can be measured and reported. We focus on administrative data sources and when conducting analysis with multiple sources (including administrative data, survey data and alternative sources). This is not an exhaustive list of all possible ways to deal with errors in administrative data, and we plan to update the document as research develops. We also welcome feedback.

Nôl i'r tabl cynnwys

2. Error in administrative data sources

Administrative data are collected during the operations of an organisation. Government produces a large amount of administrative data, providing a valuable resource if used correctly. There are legal gateways, which can allow accredited and approved researchers to access administrative data for research and statistical purposes. There are certain criteria to meet to ensure this can happen, including the assurance that a person’s identity cannot be identified in the information disclosed for research and statistics.

Administrative data are generally not collected for the sole purpose of producing statistics. This can lead to challenges when using them for this reason, a summary of which can be found in the Journal of the Royal Statistical Society article, Statistical Challenges of Administrative and Transaction Data.

This catalogue organises errors in administrative data sources when used for statistical purposes. We include error types as they are currently used and known in the statistical field. Our catalogue has similarities to the Total Survey Error Framework of Groves and others (2004) and incorporates features on administrative data sources as discussed in Bakker, 2010; Zhang, 2012; Reid and others; Rocci and others and KOMUSO, 2019). What is different in our publication is that we focus on error in single and multi-source administrative data, as a need for this has been identified from the work that we do in the Office for National Statistics (ONS). It does not focus on longitudinal data, since we previously published an error framework for longitudinal administrative data sources.

This catalogue provides a brief overview of the different errors. For each error, there will be an explanation of the error type and what the error is, and examples of how it can be measured, reported and resolved. This catalogue is not intended to provide a comprehensive literature review of all methods for measuring and dealing with errors, but it provides theoretical and practical examples that will hopefully lead the reader in the right direction when aiming to get awareness of and resolve errors in administrative data sources.

Nôl i'r tabl cynnwys

3. Measurement errors

Measurement error is the difference between what you want to measure and what you collect or obtain. We can separate measurement errors into validity errors and reliability errors. We talk about measurement error and ways to measure it, in general, in this section. In the following sub-sections, we also include ways to measure and disentangle validity and reliability error.

Models using latent variables are often used to quantify measurement error. A benefit with these models is that they often do not rely on a gold-standard dataset for comparison. Structural equation models (SEM) can be used to assess measurement error on continuous data and when there are multiple (at least two) data sources. These can be administrative data sources, survey sources or a combination of both. We previously explored this in our paper using structural equation modelling to quantify measurement error in different administrative sources for floor area.

Latent class models (LCMs) assess measurement error in categorical data with at least two sources. Certain assumptions need to be adhered to with LCMs. The first is internal homogeneity of the classes. The second is that within each latent class, the observed variables are statistically independent (for example the occurrence of one does not affect the probability of occurrence of the other). This assumption of independence can be relaxed a little because the latent variable explains why the observed items are related to one another. Often you can relax more assumptions by building more complex latent class models, depending on the software that you use.

Single‐trait multimethod (STMM) is a latent class modelling approach that can be used to estimate classification errors of the different categories within a variable (Oberski, 2017). Classification error is a type of measurement error and occurs when individuals are in a different category in the data than they should be. STMM also measures for any mode effects which is the error or difference produced from how the data were collected, when you have collected data using different modes (such as from surveys or administrative sources).

While STMM models estimate classification error, they do not try to correct for classification error or conflicting values. If looking to also correct for the classification error as well as estimate it, there is a method called multiple imputation with latent classes (MILC) that can be used when there are multiple data sources. MILC also has extra steps included in the method to account for some of the uncertainty in the modelling using a combination of bootstrapping and multiple imputation. The Methodology and Quality Directorate at the Office for National Statistics (ONS) has carried out and published initial stages of research using MILC on administrative data in 2023 in the ONS Methodology working paper series. Boeschoten and others apply this model to the problem of classifying a homeowner versus a renter, using two surveys and a register from the Netherlands, as shown in their article published in the Journal of Official Statistics.

Another possible approach is the “generalized multi-trait multimethod” (GMTMM) model. This can be used to assess measurement error when there are different types of data and variables from different sources that make up what you want to measure. Examples of these different data are discrete values, non‐normal distributions and non‐linear associations. The GMTMM can be applied to any data type. As shown in their article in the Journal of the American Statistical Association, Oberski and others applied this model to income measurement in Germany, with linked survey and administrative data from the German Federal Employment Agency.

Validity errors

A validity error is a type of measurement error that captures the difference between what is measured and the true value. Typically, administrative data is collected for a specific purpose such as service provision and its primary intention is not for use in official statistics. Validity error, in the administrative data sense, is when the collected data measures a different concept than the variable or data needed for statistical purposes or analysis.

How validity errors can be measured and reported

Single-source assessment

This starts with assessing the error at single source (even if there are multiple data sources). You should assess differences between the data collected and what you aim to measure. This can be done through comparison with data collected from other administrative sources, surveys or alternative data sources. There can also be comparisons between the data collected in the current administrative source and with previous cuts from the same administrative source to check validity. Another step at single-source assessment is to collect metadata and information from data suppliers during the registration and acquisition process to provide evidence for how the data are collected, and what the variables and values mean. We at the ONS are developing an administrative data conversation toolkit to help analysts to have these conversations with data suppliers.

Multiple-source assessment

The unit-error theory (Zhang, 2011) can be used to measure the differences in statistical validity between multiple data sources of the same concept. Quality is compared across the different data sources through fitting a stratified multinomial model on the data. The root squared errors of prediction (RSEP) from this model are used to evaluate the differences between the different sources. See this further explained under Unit error.

Reliability errors

Reliability is a type of measurement error that captures the consistency of results repeatedly collected under the same conditions. Small differences in such repeated observations suggest a small reliability error. If a measure is reliable (has low reliability error), any differences between repeated measurements can be attributed to changes in the underlying values and not to introduced errors.

How reliability errors can be measured and reported

Reliability errors can be measured and reported as a measure of variance, confidence interval or coefficient of variation. When using models, you can reduce reliability error by improving your statistical model. This can be achieved through either using stronger predictor variables or changing the model specifications to obtain a better fit to the data.

Missing data error: unit and item missingness

Missing data error occurs when some units or variables are not observed or measured. This is different to coverage error (where there was no attempt to measure a unit). If there was an attempt to measure the unit and that failed, this is missing data error and sits within measurement error methods. Missing data error can refer to:

Unit missingness – where there are missing records or missing observations, for example, where there has been an attempt to measure people on administrative data, but they are still missing through failure to respond, or for other reasons
Item missingness – where there are missing variables or missing values, for example, item missingness can occur if measuring housing statistics, and certain variables such as building type have missing values in some categories

How missing data error can be measured and reported

There are two main strategies for dealing with missing data error in general. To deal with unit missingness, you can use weighting, which requires some additional information to define the weighting groups or classes.

To deal with item missingness, you can use imputation methods that complete a dataset (see more about imputation error in Section 4). We at the ONS are currently exploring additional methods to deal with missingness in administrative data.

Nôl i'r tabl cynnwys

4. Representation errors

Representation errors describe to what extent the records in the dataset represent real life (for example people, events, businesses, and transactions). Ideally every object in real-life has a corresponding object or unit recorded in the dataset. In this section of the catalogue, we identify two subtypes of representation error: coverage error, which is the difference between the target population and the accessed set, and selection error, which is the difference between the accessible set and accessed set. So, selection error lies within coverage error. We also discuss methods for measuring and dealing with representation errors.

Coverage error

Coverage errors arise in several ways:

Undercoverage is when units that you aim to collect are not represented in your dataset. The datasets themselves may not cover the whole target set of observations; or linkage errors may mean we cannot identify some people or units that should be included.

Overcoverage happens when you have additional units in your data that should not be included. There may be duplication, for example, if a dataset is based on hospital records and a person registers at the hospital several times. Other common causes for overcoverage include individuals who have left the country and not deregistered and are therefore wrongly still included in the administrative data.

Coverage error can also result from definitional differences between the administrative data and the statistics needed. An example of this is where, for statistical purposes, we need individuals, to be included in administrative data, who are defined as permanent residents: who reside in England for 12 months or more. Whereas GP administrative data, can include records of individuals who are in the country for shorter periods of time. This can result in overcoverage based on our definitional needs for our statistical purposes when measuring the population.

Selection error

An example of selection error is when some events fail to be reported in the data, while some of the reported events may not be valid. An example is where we are producing statistics on car owners from an administrative dataset but included in that administrative dataset are members of the public who lease their cars.

How representation error in general, and coverage error and selection error can be measured and reported

You can reweight individuals in a dataset to resolve representation error. For instance, if men are over‐represented compared with women, the men will be down‐weighted and the women up‐weighted. These reweighting methods rely on additional variables or a gold-standard dataset being predictive of the variable that you want to reweight, otherwise, they will not improve the representativeness of the sample.

A way of quantifying representation error in administrative data is comparing probabilities of a person in an administrative dataset being assigned to a certain classification, given a set of covariates (propensity scores). This method is called R indicators which is applied in the linked publication to survey data. Once the propensity scores are estimated using population auxiliary information from a census or a representative survey, we can take their inverse and use them to weight the administrative data, which will correct for the lack of representativeness. We are currently testing these methods in the Office for National Statistics (ONS) on administrative data.

Dual System Estimation (DSE) is a technique used to estimate and measure specifically undercoverage error when there are two data sources, such as in this case when Statistics New Zealand combined the Census and an administrative-based population dataset. DSE assumes that the probability of a unit being included in one source is independent of being in the other source. Overlap between the two sources can guide how much undercoverage there is in the two sources. One approach is to use a log‐linear model to estimate the total number of missing units. This method works for two surveys (Census and coverage survey), and we have done research in ONS to measure the population using this method with administrative data and surveys together.

Multiple system estimation (MSE) can be used to estimate specifically undercoverage where there are more than two data sources, and is used in a similar way to DSE. Van der Heijden and others estimated the population of New Zealand using different combinations of survey and admin data sources (PDF, 372KB). There are various assumptions with DSE and MSE to adhere to and these methods do not measure overcoverage. In terms of application to administrative data, the ONS has carried out research to apply MSE methods to multiple administrative and survey data and we have concluded that more work and development of methods is needed for successful application.

There are also methods to address overcoverage in administrative data: Rule-based methods are sometimes used in the ONS to decide whether to remove records assumed to exist because of overcoverage, such as a rule not to include records representing a new registration after migration from abroad. The ONS has also used model-based methods to account for overcoverage, such as fractional counting. Fractional counting captures uncertainty related to alternative values being correct through weighting and counting contradictory values. A trimmed DSE can also remove erroneous records from the administrative data and then estimate the remaining level of undercoverage (Zhang and Dunne, 2017).

Nôl i'r tabl cynnwys

5. Processing errors

Processing errors can occur on administrative data. Administrative data sources may be prone to processing error, since the data may require substantial processing to enable use for other purposes. There are two errors that are part of the processing errors: Model error and non-model error. Within this section, we also discuss linkage errors, which can arise when multiple data sources are linked.

Non‐model error

Non‐model processing errors can take place as data is being prepared for use. They include:

data entry errors – errors from incorrectly entered data
coding errors – errors made in the process of classifying data into simplified or aggregated categories
editing errors – errors editing data to attempt to correct for other, pre‐existing errors

How non-model error can be measured and reported

It is possible to attach safeguards during data entry to try to identify possible errors (for example, by flagging unlikely or inconsistent information). Quantitative sense checks are also helpful in identifying non-model error, for example identifying outliers such as a 180-year-old member of the public. Qualitative research with data suppliers can identify where and how this is likely to occur.

Non-model errors are typically checked through auditing the data processing stage itself. For example, through comparing raw data with entered data to identify data entry errors, or quality assuring syntax used for data editing. The UK Statistics Authority has a Quality Assurance of Administrative Data Toolkit (PDF, 243KB) to help identify these early-stage errors. The Office for National Statistics (ONS) has also developed the Administrative Data Quality Framework that guides us through the quality assessment process. Another way is to compare the distribution of a variable before and after editing. If there are editing errors, this might disrupt the distribution.

Model error

Model errors are caused by analysts introducing error or bias in the data through how they specify and adjust models used to produce the data or outputs.

How model error can be measured and reported

Model error is assessed through techniques, such as sensitivity analysis and variance estimation using bootstrapping. Bootstrapping creates several samples from a single dataset. Based on these samples we can calculate the variance and get a measure of the error. Sensitivity analysis is a way to test the sensitivity of a model to its statistical assumptions. For example, we can recalculate outcomes based on different assumptions to find the impact of a certain variable.

Model-based imputation error

Imputation errors can arise through adjustments made to the data to account for item non‐response. Adjustments through imputations are based on models, and so constitute a type of model error. Missing values in a dataset may be imputed with plausible values through a variety of approaches. All imputation methods rest on assumptions about the model being used to generate the imputed values, thereby creating a potential for imputation error. For example, regression methods can be used to impute values for item‐missing responses. Errors could then result from the assumption that the relationship between the variables used to fit the model and the variable with item‐missing values is the same for all observations.

How model-based imputation error can be measured and reported

Machine learning approaches measure imputation error. As shown in their article, Bak and others describe their approach, which uses observations without any missing values to simulate patterns of missing values and estimate the errors that would result from imputation. These measurements of errors are then applied to cases that do have missing values.

Unit error

Unit error occurs when new statistical units need to be created that do not originally exist in the data but are comprised of the data records (Statistics New Zealand, 2016). These new variables created can also be called derived variables. This issue can be prevalent for administrative datasets because administrative data are not collected for the statistical purposes they are being used for. For instance, to create household units from datasets containing information on dwellings and people, we must decide which dwellings should have a household created, and which people should be assigned to which household unit. Errors occur when, for example, people are incorrectly assigned to a household.

How unit error can be measured and reported

We can measure unit error using Zhang’s unit‐error theory (PDF, 135KB). An example of how to apply this theory could be the relation between newly created household units and individuals in a dataset being linked by address in a series of allocation matrices. We then have two versions of each allocation matrix – one capturing the true allocation and the other capturing what is derived from the administrative data source. To assess error, the joint distribution of this pair of matrices is then estimated using an audit sample. An audit sample is a way to check for evidence of quality without having to use the entire dataset or population and usually comes from a sub sample or secondary sample.

Probabilistic linkage error

When data linking is required, there is the potential for linkage errors to arise. Linkage error can occur in two ways, through false matches or missed matches. False matches occur when a record is incorrectly linked to another record, whereas missed matches occur when a record is not linked to another source, even though the same individual or unit of interest is present there.

How probabilistic linkage error can be measured and reported

There are four main approaches for measuring linkage error, as set out in a Guide to evaluating linkage quality and in this article outlining challenges in administrative data linkage by Harron and colleagues.

The first is to compare with a “gold-standard” dataset, which is assumed to be error free. However, a gold-standard dataset will not be available in most cases. Instead, samples of matched records, stratified by probabilistic score, may be reviewed to estimate the number of false matches. In addition, samples of unmatched record pairs, again stratified by probabilistic score, may be reviewed to estimate the number of missed matches.

The second approach is to assess the rate of implausible scenarios among linked data, for example re‐admission after death in hospital data in this study by Hagger‐Johnson and colleagues (found on the National Library of Medicine website).

Thirdly, variables that are common across the linked datasets may be compared in the linked and unlinked data, to provide indicative evidence around possible bias introduced through linkage error. While not all records are expected to perfectly match, differences in the distributions of variables can point to possible bias in the linked data. You can compare the distribution of characteristics in the population to the distribution in the linked datasets. For example, if you have a linked dataset where 60% of linked persons are aged over 18 years but the ONS has reported that 80% of the general population are aged over 18 years, then you are likely to have either missed a large number of over-18 links, or to have made a large number of false under-18 links. These missed or false links cause bias in the linked dataset. Where you have used a gold standard or clerical review to find missed links and incorrect link errors, you can look for biases in those errors by comparing the characteristics of correct decisions to erroneous decisions. This approach will provide much more accurate bias estimates than comparing linked and unlinked records, as the linked and unlinked approach does not distinguish between missed links and records that truthfully should not have linked.

Finally, sensitivity analysis may be performed. Sensitivity analysis compares how robust a model is with its underlying assumptions. You can test this by changing the criteria used to link your observations and see how that changes your estimates.

As discussed in the National Statistician’s Quality Review on data linkage (shown on the GOV.UK website), in terms of reporting the quality of a linked dataset, this is about much more than the match rate (the proportion of records that has been linked). Match rate alone gives no indication of the accuracy of the linkage. In addition, the quality metrics “precision” and “recall” should be estimated and reported, where:

So, precision is a measure of the accuracy of the matches made, and recall is a measure of the proportion of true matches that have been made.

Identification error

Identification errors are a form of linkage error that can occur when assigning individual records to a higher level of grouping (or aggregation). The error occurs where there are discrepancies in the grouping variable itself, or the level of the grouping variable, across datasets.

Two examples of how identification errors arise are provided by Statistics New Zealand in their Guide to reporting on administrative data quality. Firstly, suppose there are person‐level data linked by a common identifier across several datasets and the objective is to distinguish groups of people living at the same address. If the different datasets contain different addresses for the same person, an identification error may occur when trying to decide on a single address for each person.

A second example considers a survey of building activity, where the objective is to link individual consent approvals to the construction jobs to which they relate. In most cases, one “consent” corresponds to a single construction job, but there may be situations where one job requires multiple consents (for example, for separate stages of work). At the same time, a single consent may apply to more than one job. Errors in correctly attributing each job to the consent(s) that relate to it would constitute an identification error.

How identification error can be measured and reported

You can measure identification error with the help of bootstrapping, to simulate new distributions to compare with your data. We at the ONS created measures of uncertainty when we needed to distribute migrants to local authorities (LAs) for the mid-year population estimates. To obtain these, we used a mix of bootstrapping methods. This created simulated distributions of LA counts, where the variance from the original mean reflected the uncertainty.

Nôl i'r tabl cynnwys

6. Feedback

If you have any feedback, comments or would like to collaborate with the Methodological Research Hub, please contact Methods.Research@ons.gov.uk. This is a working paper and can be updated with new research.

Nôl i'r tabl cynnwys

7. Acknowledgements

We want to thank and acknowledge Professor Paul Smith and colleagues at Southampton University for their work with this error catalogue.

Nôl i'r tabl cynnwys

8. References

Bakker BFM (2012), ‘Estimating the validity of administrative variables’, Statistica Neerlandic, Volume 66, pages 8 to 17

Groves R, Fowler F, Couper M, Singer E and Tourangeau R (2004), ‘Survey Methodology’, New York: Wiley

Oberski, DL (2017) Estimating error rates in an administrative register and survey questions using a latent class model, in Biemer PP, de Leeuw, E, Eckman, S, Edwards, B, Kreuter, F, Lyberg, LE, Tucker, NC and West, BT (eds) Total survey error in practice, pages 339 to 358. https://doi.org/10.1002/9781119041702.ch16

Zhang, L-C and Dunne, J (2017) Trimmed dual system estimation, in Capture-Recapture Methods for the Social and Medical Sciences (eds D Bohning, J Bunge, and P vd Heijden), Chapter 17, Chapman and Hall CRC, pages 239 to 259

Zhang L‐C (2012), ‘Topics of statistical theory for register‐based statistics and data integration’, Statistica Neerlandica, Volume 66, pages 41 to 63

Zhang L‐C (2011), ‘A unit‐error theory for register‐based household statistics’, Journal of Official Statistics, Volume 27, pages 415 to 432

Nôl i'r tabl cynnwys

9. Cite this methodology

Office for National Statistics (ONS), released 21 April 2023, ONS website, methodology, Cataloguing errors in administrative and alternative data sources – what, when and how

Nôl i'r tabl cynnwys

Cookies on ons.gov.uk

Cataloguing errors in administrative and alternative data sources – what, when and how

Yn yr adran hon

Validity errors

How validity errors can be measured and reported

Single-source assessment

Multiple-source assessment

Reliability errors

How reliability errors can be measured and reported

Missing data error: unit and item missingness

How missing data error can be measured and reported

Coverage error

Selection error

How representation error in general, and coverage error and selection error can be measured and reported

Non‐model error

How non-model error can be measured and reported

Model error

How model error can be measured and reported

Model-based imputation error

How model-based imputation error can be measured and reported

Unit error

How unit error can be measured and reported

Probabilistic linkage error

How probabilistic linkage error can be measured and reported

Identification error

How identification error can be measured and reported