Executive summary

As the recent review of the Approved Researcher scheme concluded that Accredited Researchers should only access legally-protected data in a secure environment, we made the decision in October 2016 to stop the distribution of Office for National Statistics (ONS) data under the terms of a Special Licence. In partnership with the UK Data Service, we reviewed options to ensure research needs are met following this decision, and we now report on the findings and conclusions from that review.

Comprehensive empirical and theoretical analyses of ONS Labour Force Survey (LFS) data made available under the terms of a Special Licence (SL) and End User Licence (EUL), and the Special Licence Living Costs and Food Survey (LCFS) data, were completed to determine whether additional variables could be made available under a EUL. These exercises confirmed that an expanded LFS dataset (EUL variables with local authority district level geography and detailed country of birth) and the Special Licence LCFS data were disclosive and personal information, as defined by the Statistics and Registration Service Act 2007, and require legal protection. The findings confirmed that the decision to stop the distribution of the Special Licence data was correct as it is the only way to meet legal obligations to protect the confidentiality of data subjects.

We explored the possibility of adding detailed geography and variables to ONS EUL datasets, particularly where researchers have told us these would be important. Following an assessment of the risk to the disclosure of data subjects for each of these data, we will implement the following changes:

  • an additional level of geography (police force area) will be included for future versions of the Crime Survey for England and Wales (CSEW) EUL data; we plan to add these geographies to 2015 to 2016 CSEW EUL data by December 2017

  • inclusion of month and year of interview and single year of age up to 80 years old (80 and over will be top coded) for future versions of the Wealth and Assets Survey EUL data; we will also consult with users on their preferences for level of geography (for example, include regional rather than national level detail, but without Output Area Classifications (OAC)

  • additional variables (number of cars and vans) will be included in future EUL versions of the LCFS EUL datasets

We will continue to monitor the content of ONS EUL datasets to ensure we protect the confidentiality of data subjects and maximise the utility of the data without compromising the risk of identification.

No future ONS Special Licence datasets will be produced for any purpose. Previous users of Special Licence versions can continue, given appropriate approval, to access the far more detailed data within the Secure Research Service (formerly the Virtual Microdata Laboratory) and the UK Data Service Secure Lab for their research.

Background

We grant access to de-identified data, for statistical research purposes, to a wide range of individuals and organisations whose needs cannot be met through analysis of published summary statistics. Such access can only be granted if appropriate controls are in place to ensure the confidentiality of data subjects.

One way in which ONS ensures this is by requiring that these detailed data are only used in a secure setting, such as the Secure Research Setting (SRS) or the Secure Lab. The ONS Approved Researcher scheme is the legal gateway used to grant access to potentially disclosive research data for analysts outside of government.

We have allowed the use of less-detailed data by Approved Researchers, with fewer restrictions, as downloads under the terms of a Special Licence from the UK Data Service and analysed on researchers own systems. The level of detail provided in these Special Licence data, though significantly less than for data in the SRS and Secure Lab, is sufficient that the data are still considered "personal information", as defined by the Statistics and Registration Service Act (SRSA 2007), and require a legal gateway for access. Unlike analysis carried out by researchers in the SRS or Secure Lab environments, there are no checks on the outputs produced to help protect the confidentiality of data subjects.

Following the decision to stop access to ONS Special Licence data as a download, the data were made available in the SRS and Secure Lab. We permitted existing research projects using ONS Special Licence data to continue and conclude under the agreements already in place.

We also make research data, with significant restrictions on important variables (for example, age and geography) available for statistical research purposes under the terms of an End User Licence (EUL). These “safeguarded” data have a very low risk of identification and are not personal information, that is, they are not protected under the SRSA 2007. Researchers must agree to the conditions of use, but do not have to be accredited and can download a copy of the data from the UK Data Service.

The “intruder testing” approach adopted by Office for National Statistics (ONS) is one supported by both the UK Anonymisation Network (UKAN) and the Information Commissioner's Office (ICO) in the ICO Anonymisation Code of Practice. The Code articulates that a useful test of whether anonymised data are likely to result in the re-identification of an individual is “one used by the Information Commissioner and the Tribunal that hears DPA (Data Protection Act) and FOIA (Freedom of Information Act) appeals – involves considering whether an ‘intruder’ would be able to achieve re-identification if motivated to attempt this.”

There is an assumption in the statement that there are motivated intruders, so there is not a valid defence that no-one will ever try to identify individuals. In meeting our obligations to data subjects, it is irrelevant whether or not we think such intruders exist. UKAN emphasise that the empirical intruder testing should not be used as a replacement for theoretical disclosure risk metrics, but is a tool to be used alongside those more traditional methods.

Review of ONS Special Licence data

In October 2016, we wrote to Approved Researchers explaining the rationale for the decision to stop downloads of our Special Licence data and that we would work with the UK Data Service to review the use of these data with the purpose to:

  • better understand why researchers use Special Licence data (as opposed to those available under a End User Licence (EUL) or in the Secure Research Setting (SRS) or Secure Lab)

  • determine whether data currently made available under Special Licence could be made available under a EUL; basing the determination on whether the level of detail of particular variables would be suitable for access (that is, not potentially disclosive) under EUL

  • assess how this decision is likely to increase demand for access to data through the secure environments

  • understand the arrangements put in place by researchers to ensure disclosure control

  • communicate with researchers regarding decisions that are made, seeking information from them as necessary

Use of ONS Special Licence data

Table 1 records the use of our Special Licence and End User Licence (EUL) data by researchers for the period January 2011 to September 2016. There was considerably greater use made of EUL data (average number of 23,800 downloads per annum) compared with Special Licence data (previously average number of 2,200 downloads per annum).

Table 1 shows that 79% of the ONS Special Licence downloads were for the Labour Force Survey (LFS) or Annual Population Survey (APS) data. These data also accounted for 74% of ONS EUL downloads.


The UK Data Service analysed almost 200 ONS Special Licence project applications for the period April 2013 to October 2016, to gather information on the reasons provided by researchers to justify their need for the data. For almost all the applications analysed, detailed geography was the main reason provided. The only exceptions to this were for data such as the General Lifestyle Survey, where no EUL dataset exists.

The other main reason given for requesting Special Licence data was the more detailed characteristics. The variables and reasons for use most frequently identified as important were: detailed Standard Industry Classification (SIC) and Standard Occupation Classification (SOC), use for linking and employment or economic activity for the LFS and APS data; crime modules for the Crime Survey for England and Wales (CSEW) data; and age or date of birth and income or finance for a number of surveys.

Feedback from Special Licence users was very helpful in further developing our understanding of why researchers use these data. We received and responded to several emails from researchers. A number of the respondents stated that the removal of Special Licence access was an inconvenience for them as the Secure Research Setting (SRS) and/or Secure Lab didn’t fully meet their needs. We are taking forward work to develop the capacity and connectivity of the SRS (including remote access) to improve access to ONS research data.

We are engaging with researchers to ensure their views are taken into account. Feedback on users’ needs was also provided by relevant ONS survey teams. This was very helpful in identifying the most important variables for researchers for specific datasets and we took account of this feedback when deciding on which variables we considered for additional inclusion in the EUL data. Following feedback from the Institute of Fiscal Studies, we carried out an additional intruder testing exercise using the Living Costs and Food Survey data.

At 6 October 2016, there were 142 active projects accessing ONS Special Licence data. These were carried out by 65 organisations – 49 of which had remote access to the Secure Lab. The majority of these projects have now completed or the researchers are accessing the data in the SRS or Secure Lab. An assessment of the impact of stopping Special Licence access on the safe SRS and Secure Lab settings concluded that this would be low.

Determining whether data currently made available under Special Licence could be made available under an End User Licence

ONS and the UK Data Service carried out separate, but related, reviews of the data. Two different types of study were carried out:

  • a unique analysis assessment was carried out on an expanded EUL version of the April to June 2016 ONS Labour Force Survey (LFS) data

  • an intruder test exercise was carried out for both the LFS and the Special Licence Living Costs and Food Survey (LCFS) data

The aim of the intruder test was to replicate or create a scenario whereby actual intruders seek to either find an individual within the dataset, or to demonstrate that the dataset is unsafe. The LFS was selected on the basis of the extent of its use by researchers (see Table 1) whilst the LCFS was selected in response to feedback from the Institute of Fiscal Studies about its importance.

The purpose of the exercise was also to determine whether data currently made available under Special Licence could be made available under EUL, basing the determination on whether the level of detail of particular variables would be suitable for access (that is, not potentially disclosive) under EUL.

We used an expanded version of the EUL April to June 2016 quarter of the LFS – containing approximately 90,000 individuals and 40,000 households. This included additional detail on the geography (local authority district rather than region) and the full detail on country of birth (275 categories) as users had indicated that geography and these variables were the most valuable to them.

The ONS unique analysis assessment focused mainly on the:

  • additional breakdown of geography (at local authority level)

  • retention of the four-digit industry (615 categories) and occupation (353 categories) detail

  • detailed country of birth variable (275 categories)

Table 2 shows that just considering one variable at a region level of geography often gives rise to a few uniques within the sample data, and only a small number of additional variables are needed to establish widespread uniqueness in the data. For example:

  • industry classification combined with occupation classification equals counts of 1 in 1,210 cells

  • industry classification combined with occupation classification multiplied by occupation equals counts of 1 in 17,740 cells (20% of all respondents)

  • adding age to the combination of industry classification combined with occupation classification multiplied by occupation gives counts of 1 in 37,897 cells (42% of all respondents and 91% of populated cells are 1s)

This means that there is a risk that a user or intruder can identify an individual from a very small number of variables. We used occupation, industry and age as an example of this as these have greater detail than most other variables in the dataset. A correct claim would enable the remaining variables to be disclosed, for that one individual.

The dataset is a sample, of much less than 1% of the population. Inclusion in the dataset is not a fact in the public domain. Because of this, the intruder may not be certain as to the validity of a claim of disclosure. The sample element is a barrier to the user or intruder making a claim. Considerable uncertainty is introduced by that, and the high geography level. Response knowledge is not in the public domain but might be available through private conversation, for example.


Many of those sample uniques will only be unique due to the small sample. But a significant number will not. Just a small number of variables, albeit the most detailed, but those among the most important to the LFS user community, give rise to uniqueness in a large percentage of the sample.

The LFS dataset already has some disclosure risk within the EUL product. If a user or intruder knew a person even reasonably well, they could establish near certainty in an identification based on industry, occupation, region and a few demographic variables. However, the lack of response knowledge is the main barrier to an intruder making a claim on the current LFS safeguarded dataset (and those from other social surveys). Since a user or intruder would need the private knowledge of inclusion in the dataset to make identification with any confidence, the current LFS EUL dataset is therefore not personal information under the Statistics and Registration Service Act (SRSA) 2007. This is a conclusion supported by the The Anonymisation Decision-making Framework (PDF, 1.97MB).

However, the risk of there being sufficient information in the public domain for an intruder to make a disclosure claim is likely to increase over time, though the rate of that increase is open to speculation. There is already a large amount of publicly available information on prominent individuals (for example, on Wikipedia pages) and so an existing raised risk of disclosure on them in safeguarded datasets. But we have to consider what is reasonable to withhold – in terms of the balance of risk and utility, it may not be reasonable to collapse a number of variables so that a small number of specific records can be included in any dataset.

However, the risk of an identification claim is thought to be low. The EUL datasets are long running and there have been no claims made, though it is quite possible that users have identified individuals inadvertently by spontaneous recognition and not made that known to others.

The UK Data Service (UKDS) analysis considered the four quarters of the 2015 LFS datasets. Two scenarios were considered – spontaneous recognition (inadvertent re-identification of a data subject because they are familiar to the researcher, or has a rare combination of attributes that makes them stand out) and private database cross-match, whereby a researcher identifies a data subject through use of external data sources.

For each of the datasets and scenarios, the UKDS used the Data Intrusion Simulation (DIS) method (see Appendix C in The Anonymisation Decision-making Framework (PDF, 1.97MB) for more information) to estimate file-level disclosure risk. The underlying assumption of DIS is that an intruder has information about a particular person and will try to match it with a record in a research data file to find out additional information.

The main findings from the private database cross-match scenario were:

  • the disclosure risks for the four Special Licence LFS datasets were roughly 6%, compared with less than 2% for the EUL datasets; both figures fall below the acceptable thresholds stated in Table 3

  • the residence geography has the greatest impact on a file’s disclosure risk

  • about 4.5% of the records in the Special Licence microdata are unusual, compared with virtually none (fewer than 50) in the EUL microdata

  • it is possible to lower the residence and workplace geography in the EUL datasets to NUTS 3 and NUTS 2 respectively and still keep disclosure risk of the files and the proportion of unusual records in them within acceptable limits (providing the level of detail was reduced in some other important variables); the NUTS 2 and 3 geographies include unitary and county local authorities, as well as district councils (some grouped)


The UK Data Service finding that geography is the greatest risk to disclosure is consistent with the findings from ONS theoretical and empirical tests.

The UK Data Service spontaneous recognition scenario exercise concluded that approximately one-tenth of the records were unusual. Assuming the 1% limit recommended by Elliot and others (2016) for unusual records applies to EUL datasets, two-digits would be the recommendation for Standard Industry Classification (SIC) and Standard Occupation Classification (SOC) codes. The UK Data Service recognised that this would be a considerable concern for users (four-digit SIC and SOC variables are in the existing EUL LFS data), and so suggested releasing three-digit SIC and two-digit SOC codes in the EUL data would be feasible. This argument is supported by analysis of Secure Lab use of LFS data over the last couple of years, which highlights that researchers made far greater use of SIC codes rather than SOC codes.

The UK Data Service concluded that other main LFS variables had little impact on disclosure risk.

Intruder testing exercises

For the two separate intruder testing exercises using the same expanded LFS EUL data and the 2014 Special Licence Living Costs and Food Survey (LCFS) data, ONS staff were recruited to become “friendly intruders” over the period of one week for each exercise. Representation was predominantly from the recently established ONS Data Science Campus, together with other analysts from a range of ONS teams including survey teams, Statistical Disclosure Control and Research Support and Data Access. UK Data Service staff also took part in the LCFS exercise. In total, there were approximately 20 intruders for each exercise.

The expanded LFS and LCFS Special Licence data were deposited (at separate times) in the SRS (and the Secure Lab for the LCFS intruder test) and created in a number of forms, readable into SAS, SPSS, R, Stata and Python. Each intruder had a specific personal folder that held the dataset, metadata and other survey documentation, and a form to be used for all identification claims. Intruders were asked to record their confidence in any claims made. They were given unrestricted internet access, so that other public sources could be used to match with any detail on the dataset. To access the SRS and their personal folder, each intruder had to sign a Data Access Agreement to ensure they protected the confidentiality of data subjects.

For the LFS intruder test exercise, six intruders submitted claims. In total, there were 31 claims, though four were for self-identification (of the household's four individuals) due to one intruder actually being an LFS respondent at that time. It was helpful to note that the self-identification was deemed “certain” by the intruder's assessment of confidence in the claim, emphasising that the level of detail in social survey microdata is extensive in terms of the number of variables (in contrast to previous intruder testing exercises on aggregated tables where not all claims of self-identification were correct).

Of the 27 other claims, none were for friends, family or acquaintances. Nine of these were correct claims (by four different intruders), as assessed by the LFS team. All the correct claims had some level of geography as an important part of the claim. Correct claims were more often for small- to medium-sized local authority districts, but there were also correct claims for one large city and even one region. In the last instance, it was clarified that the intrusion was possible without the additional inference of the local authority level information.

Those variables mentioned explicitly by intruders in at least three correct claims from the LFS data are listed in Table 4.


Five of the correct claims included a mention of one or more of country of birth, year of arrival and nationality. Other variables mentioned in successful claims were marital status, qualifications, income (indication that was confirmatory), and type of organisation.

When an intruder was confident in a claim, they were likely to be correct. This correlation is particular to unpublished research data where there are a large number of variables pertaining to any individual.

ONS and UK Data Service also carried out a joint intruder testing exercise using the 2014 Special Licence LCFS dataset. The survey consists of a household and individual questionnaire and data from both the household and person files were used in the exercise. Each individual aged 16 and over in the household is asked to keep diary records of daily expenditure for two weeks. Approximately 5,000 households from the UK are included in the survey with fieldwork carried out over a 12-month period (that is, January to December 2014).

A total of 40 identification claims were made. Of this number, there were 13 claims (for four households) correctly identifying people (data subjects) in the data. Of these correct claims, eight were for the same four people in a household, that is, there were nine unique individuals identified.

Intruders used variables (such as single year of age, government region, number of adults and/or children in the household, relationship to householder and sex) combined with published data sources (primarily social media, for example, Facebook, LinkedIn, Twitter) to find out detailed characteristics of data subjects and make identifications. Single year of age (up to 80 years old) and sex were used together in all correct identifications. The period of survey completion was also used to make identifications. There was a strong correlation between the confidence of the claims and correct identification.

Those variables mentioned explicitly by intruders in at least two correct claims for the LCFS data are listed in Table 5.


Conclusions and recommendations

The review met its aims as follows:

  • detailed analysis of why and how researchers use ONS Special Licence data was completed and the review methodology and conclusions take account of this

  • comprehensive theoretical and empirical analyses of the data made available under the terms of a Special Licence and End User Licence (EUL) were completed to determine whether additional variables could be made available under a EUL; this confirmed that the Labour Force Survey (LFS) and Living Costs and Food Survey (LCFS) Special Licence data are disclosive and personal information, and require legal protection

  • although there has been no confirmed unlawful sharing or disclosure of Special Licence data, it is impossible to confirm the security of these data as we cannot fully monitor or control their use

  • one suspected breach of data access involving ONS Special Licence data was investigated in 2016 and 2017; this confirmed that a procedural breach occurred;

  • an assessment of the impact of stopping Special Licence access on the safe Secure Research Setting (SRS) and Secure Lab settings concluded that this will be low

  • we have sought and taken account of the views of Approved Researchers on their use of the data and the impact on them; we will send researchers a link to the published report

The ONS and UK Data Service analyses and intruder testing supports the argument that the expanded LFS dataset should not be available for safeguarded use as the local authority geography and more detailed country of birth variables significantly increased the risk of identification. The number of successful identification claims made using the expanded LFS and LCFS datasets in the intruder testing exercises confirms that these data are personal information under the Statistics and Registration Service Act (SRSA) 2007. It is clearly possible for an intruder to develop an appropriate strategy to identify individuals from the dataset alongside other publicly available information.

It should be noted that if an intruder had to use private knowledge to identify a respondent, the data could still be released under safeguarded use (but not public use). If a user or intruder had access to response knowledge, it would almost certainly be possible to identify individuals as evidenced by the four self-identifications in the intruder test.

The theoretical analysis carried out by the UK Data Service concluded that additional geography may be possible in the LFS EUL. However, this would be at the expense of Standard Industrial Classification (SIC) and Standard Occupational Classification (SOC) code detail. As users have told us that four-digit SIC and SOC level data are paramount to the utility of the LFS for them, we do not wish to reduce the level of detail in these variables for the EUL data. Furthermore, the empirical intruder testing carried out by ONS highlighted the high risk of disclosure resulting from the use of local authority level geography.

On the basis of the ONS and UK Data Service analyses of the LFS data, we conclude that:

  • the currently approved EUL or safeguarded dataset for the LFS (that is, without the local authority district geography and detailed country of birth) has the maximum amount of detail possible to remain within the law

  • the inclusion of additional Special Licence variables in the LFS EUL data would make this personal information as defined by the SRSA 2007 and require that the data should only be accessed by Approved Researchers in a safe setting such as the SRS or Secure Lab, and not available for download

  • the review confirms that the decision to stop the distribution of the Special Licence data was correct as it is the only way to meet legal obligations to protect the data

  • the currently approved safeguarded or EUL datasets for the LFS and Annual Population Survey (APS) should continue to be made available to researchers

The following categories for nationality and country of birth will be included in the 2017 LFS End User Licence as detailed in this section.

The “main” variable has the following categories in case users are only interested in a high-level categorisation:

  • UK

  • European Union (excluding UK)

  • Other Europe

  • Asia

  • Rest of the world

  • Total

The “sub” variable provides a more detailed breakdown, as follows:

  • UK

  • European Union EU15

  • European Union EU8

  • European Union EU2

  • European Union other

  • Other Europe

  • Middle East and Central Asia

  • East Asia

  • South Asia

  • South East Asia

  • North Africa

  • Sub-Saharan Africa

  • North America

  • Central and South America

  • Oceania

These changes were included in the 2017 EUL LFS datasets (available for download in November 2017).

On the basis of our analysis of the LCFS data, we conclude that:

  • LCFS Special Licence data are personal information as defined by the SRSA 2007, and require legal protection; they should only be accessed in a secure setting such as the Secure Research Service or Secure Lab

  • the variables on the number of cars and vans will be included in the EUL data in future; additional detail on income and benefits were considered, but assessed to increase the risk of data subject identification beyond acceptable limits for safeguarded EUL data

  • we will not be making any changes to the currently approved geography level of region in the EUL data

Expansion of the rural or urban classification (from two to eight categories) was considered for future LCFS EUL data. However, the inclusion of geo-demographic variables such as Output Area Classification (OAC) does give rise to some unusual combinations between OAC and region, as would the expansion of the current urban or rural classification (URC) to eight categories. Therefore, the current two-category urban or rural classification (URC) will remain for the LCFS EUL data.

Conclusions for other ONS Special Licence datasets

Following the review of ONS LFS and LCFS EUL and Special Licence datasets, we considered the lessons learned for other ONS data that was available through Special Licence and took account of the Government Statistical Service guidelines, Disclosure Control Guidance for Microdata produced from social Surveys. We considered the variables in both the Special Licence and EUL versions to determine the likelihood that the additional detail would make the EUL version too disclosive to be classified as safeguarded data. We did not carry out intruder testing for these data. Our conclusions from this work are detailed in this section.

Annual Population Survey (APS)

No additional variables will be added to the currently approved EUL version.

Crime Survey for England and Wales (CSEW)

We conclude that the geography level can be lowered from region to police force area, of which there are 42, for the EUL data. It is planned to include these geographies in the 201 to 2016 CSEW EUL data by December 2017. We considered the inclusion of other important variables such as ethnicity. An initial assessment of ethnic group indicates that any increase in detail over the currently approved version would increase the risk of identification of data subjects beyond the acceptable limits for EUL data.

Wealth and Assets Survey (WAS)

The inclusion of month and year of interview could be made to the EUL dataset without materially increasing the disclosure risk. The geography issue is specific to WAS. There is considerable detail on financial variables (given the focus of the survey) and this is greater than would normally be allowed. However, this is offset by the lack of geography below England and Wales, and the lack of detail in some other “visible” variables such as ethnic group and religion. The reduction of the geography to region, given the current inclusion of the 76 category OAC, would increase the disclosure risk significantly, and there should be no additional information (apart from that already included) on any linking within or between waves on individuals and households. We will consult users to explore the option of replacing the national level geography (with OACs) with regional level geography (without OACs) in the WAS EUL.

ONS Opinions and Lifestyle Survey

No additional variable detail will be included in the End User Licence version.

General Household Survey (GHS)

Integrated Household Survey and General Lifestyle Survey (GLF): All these datasets have been superseded. We will not be making any revisions to the EUL datasets for these surveys. The most detailed data for all these surveys will continue to be available in the SRS and Secure Lab.

Life Opportunities Survey

No further data will be produced.

No future ONS Special Licence datasets will be produced for any purpose. Previous users of Special Licence versions would be able, given appropriate approval, to access the more detailed data within the SRS and secure lab.

References

Elliot and others (2016), The Anonymisation Decision-Making Framework, First edition, Manchester, United Kingdom: UKAN

Government Statistical Service (2014), Disclosure Control Guidance for Microdata produced from Social Surveys