1. Introduction
The Office for National Statistics (ONS), along with a wide range of other government departments release data both as tables and record level data (microdata). These departments have an ethical and legal duty to ensure that any information released into the public domain has only a minimal likelihood of allowing identification of (or an attribute relating to) an individual, household or business.
Record level data are rich in detail and it is important that a balance between utility and confidentiality is maintained. In this article we consider intruder testing as carried out on microdata with a small subsection on differences when tables are under discussion.
This article describes steps to follow that will ensure that intruder testing is a worthwhile part of the process, which will result in protected but useful microdata.
Nôl i'r tabl cynnwys2. Releasing microdata
The steps to follow when creating microdata for release are shown in this section with step 4 introducing intruder testing.
The level of detail will depend on the licence under which the data are published. Data released under an Open Licence with no access restrictions will be less detailed than data made available under licence.
- From the tabulation of the selected variable combinations look for low counts or other distinctive patterns such as where non-zero counts are distributed among a small number of cells in rows or columns.
- If there are unique or rare combinations in the data apply disclosure control to the problematic variable(s); these could include the removal of variables or records, recoding or record swapping.
- Tabulate the variable combinations again from the “protected” microdata; if there are no obvious disclosure issues go to step 4, otherwise repeat step 2 and apply more disclosure control.
- Carry out intruder testing; if there are many successful claims then steps 2 to 4 will need to be repeated.
- Publish the data under the required licence.
This article discusses the options for intruder testing in more detail.
Nôl i'r tabl cynnwys3. What is intruder testing?
This involves using individuals described as “friendly intruders” to try and see if they are able to re-identify anyone in the dataset. These intruders should have some background knowledge of the data similar to that of a typical user. However, they do not need to be specialist hackers with the capability of employing advanced data exploration techniques. The aim of intruder testing is to attempt to replicate the particular method of an opportunistic attacker, not a well-stocked professional.
Over the years there has been considerable discussion as to the level of protection required by each dataset although there is a lack of formal statistical measurements. An indication is given by the EU Data Protection Directive and UK Data Protection Act (paragraph 26), which states that “account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person”.
This is rather vague, with the term “likely reasonably” open to interpretation. However, an earlier Office for National Statistics (ONS) Code of Practice stated that if it took “disproportionate time, effort and expertise” to identify a record or an attribute then the data were regarded as being protected to a sufficient level. ONS Legal Services suggested that this definition can be defined as a few hours spent by an individual at Research Officer level (that is, entry level Civil Service statistician or data analyst) with a standard office PC or laptop and software. This will suffice as a working definition for this article.
In the intruder testing exercise, the motives of these selected intruders would not be malicious. No findings on individual claims would be released into the public domain although all claims will be looked at to see if they are genuine. It should always be kept in mind by those running the session that the aim of the intruder test is to assist in the release of a dataset in which the risk of disclosure is minimised.
There are two major types of intruder testing although there is no requirement for both to be used for any one dataset.
Internal testing
A small number of volunteers with knowledge of the data (for example, those working in the government department that will publish the data, those working in another department that uses the data or graduate students who may use the data in the future) and with internet access spend a period of time (around a half day is normal) trying to identify records in the data. Each identification claim is given a percentage of confidence by the “intruder”.
External testing
An organisation with experience of intruder testing carries out a formal test to see what can be identified. Typically, they could be given a list of people or businesses that are present somewhere in the data and asked to identify the records that relate to them, effectively mimicking the presence of “response knowledge”. Their success will be measured in terms of the number of correct identifications. As with internal testing, each identification will have an associated confidence percentage.
One of the main purposes of such a test is to try and capture what other information may be linked to the dataset by the intruder to attempt disclosure. Thus appropriate selection of intruders in terms of awareness of similar data sources and good penetration skills (able to search and analyse the data) is important to get accurate results. The information resulting from the intruder test may be used to refine the previous steps, particularly with regards to which variable combinations are considered to be quasi-identifiers and therefore utilised in the identification key.
The risk can be assessed by carrying out intruder testing. This does provide practical and empirical evidence of the protection employed.
Intruder testing can be an iterative process. The following steps refer to those steps in section 2, the Releasing microdata section.
If there are a large number of correct identifications along with any incorrect identifications resulting from the intruder testing then return to step 2 to protect the data further followed by steps 3 and 4. A particular variable with many categories may be encouraging intruders to identify individuals (some successfully). Example A: Occupation could be coded at a low level, detailing peoples’ jobs to a specific degree. Example B: Date of birth could be present, a variable of high risk of great use to an intruder.
If there are a small number of correct identifications along with any incorrect identifications resulting from the intruder testing then go to step 2 for minor additional protection followed by steps 3 and 4. Example A: Occupation or industry (or both) may have to be recoded to a higher level Example B: Month of birth could still be too disclosive.
If there are no correct identifications and only a small number of incorrect identifications (perhaps none) then statistical disclosure control may have been applied too stringently. Reduce the amount of statistical disclosure control applied in step 2 and then repeat steps 3, 4. Example A: Occupation may be coded at a broad level leading to the dataset being of low utility. Example B: Age may be grouped in 10-year age bands, possibly too coarse for most analyses.
Nôl i'r tabl cynnwys4. Why and when is it necessary?
The increasing demand from users for government departments to publish microdata alongside the more familiar tables has led to the requirement for these datasets to be assessed for disclosure risk before release. Both intruder testing and theoretical risk metrics can be used to examine the data.
The Information Commissioner’s Office Anonymisation Code of Practice (PDF download) recommends a “motivated intruder test” as a way that organisations can assess the empirical risk of re-identification from anonymised data. This test requires using someone who is “reasonably competent, has access to resources, such as the internet” and would employ “investigative techniques, such as making enquiries of people who may have additional knowledge of the identity of the data subject”. The motivated intruder is not assumed to have any specialist knowledge.
Intruder testing is not intended as a replacement for theoretical disclosure risk metrics, but a tool to be used alongside those more traditional methods. These include analysis of uniques where key variables are tabulated (usually in two, three or four dimensions) to see if any combinations are unique or rare in the data. These are usually protected prior to intruder testing.
Key variables are variables that an intruder would use as a starting point when trying to identify a member of the dataset. They are likely to be either visible, such as age band and sex, or sensitive such as a health indicator.
Intruder testing generally does not involve a thorough risk assessment of a full dataset but gives a snapshot of risk specific to the intruders used and what parts of the data they are given. For this reason, it is intended only as a secondary tool for empirical analysis of the data. It is always advised that an analysis of uniqueness is carried out alongside intruder testing; to highlight obvious risks within the data that nevertheless may not be picked up by our intruders in an intruder testing exercise.
Nôl i'r tabl cynnwys5. Brief notes on earlier intruder testing exercises for open (public use) datasets
A number of datasets have been through the process described in section 2. Some of these are highlighted in this section with more detail available in the given links.
UK 2011 Census teaching microdata
A small internal intruder testing exercise was used as confirmation that risk was reduced to an acceptable level based on the number of correct identifications. We have published details of the dataset.
UK 2011 Census tabular outputs
A small number of “intruders” were recruited, representing a range of areas in England and Wales. These included urban, suburban and semi-rural areas. Each intruder was invited to a session and was encouraged to spend around 3.5 hours, which was an amount of time considered “likely” and “reasonable”. The intruders worked in a meeting room within a secure census area. They were supplied with two laptops; one laptop with unrestricted web access and one containing the census tables. A paper on the intruder testing process is available in the Journal of Privacy and Confidentiality.
Household energy consumption open microdata (released by the former Department for Energy and Climate Change (DECC))
An open dataset and a dataset published under an End User Licence were produced. Intruder testing was carried out by postgraduate students at Southampton University with a cash prize offered for a correct identification. There were no correct identifications but as the “year of the energy performance certificate” was considered to be of particular use by the “intruders” this variable was removed from the published open dataset. Detail of the dataset and associated documentation can be seen in a DECC PDF.
Nôl i'r tabl cynnwys6. Appendix
A checklist of procedures
Intruder testing is carried out for a clear research purpose in order to effectively assess disclosure risk from an empirical perspective and as an end result, to ensure that data confidentiality is not compromised.
When should intruder testing be used?
Following the application of disclosure control, intruder testing:
- gives empirical evidence of whether the data can be published or whether further protection is necessary
- enables data users to understand the overall sensitivity of a dataset
- highlights vulnerabilities in the data, specifically variables of concern that an intruder may target to make a disclosure and consider removal or recategorisation
- emphasises what pieces of information might be combined with the data for de-identification (known as the mosaic or jigsaw effect)
What are the ethical considerations?
Ethical criteria around intruder testing must be fulfilled such as transparency about what the work involves and putting appropriate security precautions in place. The Information Asset Owner (IAO) or similar should be consulted as a person with final responsibility for the data and the decision recorded in writing. If required, an ethics review committee could be consulted, which would traditionally include a legal representative and an experienced survey researcher who has an understanding of research ethics relevant to the data type. Some universities have their own ethics approval process if using academic intruders. We have a data ethics advisory committee.
A confidentiality declaration (not necessarily required to have any legal standing) should be put together so that intruders are very clear on what they are being asked to do and what they should not be doing. This could include:
- ensuring that as far as possible, intruders will act in a trustworthy way; this could involve them getting security clearance
- using intruders who will suffer penalties should they misuse the data, for example, employees of the organisation controlling the data
- a separation of the process between who is analysing and validating claims, and who is actually “intruding”
- intruders explicitly told not to search for individuals known to the “checker”, for example, employees they work with
- discussion and consideration of geography so that the intruder has limited chance of inadvertently de-identifying data concerning individuals known to them
- ensuring that the intruder is not told which claims they make are correct so they do not learn anything new about identities of respondents in the data
What general security precautions ought to be in place?
These would be location and dataset-specific but could include:
- communicating clearly to all involved that security is being taken very seriously; ensure the intruder has the relevant security clearance
- intruder to sign a confidentiality declaration stating they will not misuse the data or abuse web access
- carrying out the testing within a secure area, which requires pass access
- supervising the intruder
- locking the laptop containing sensitive data to the desk
- internet should not be provided on the same laptop as the data to prohibit intruders from sending out the data inadvertently (as well as intentionally)
- if data are particularly sensitive then follow rules for secure settings, that is, prohibiting mobile phones or devices to take photos of the data
In most cases, the selection of intruders will include consideration of trustworthiness and we would not expect malevolent intent. Many of these measures and protections above are in place to address the potential for inadvertent risk and also the perception that might surround the situation of staff members accessing sensitive data.
What incentives could be offered?
It may be appropriate to use incentives to motivate intruders. This could depend on factors such as licensing arrangements, sensitivity of the data and the profile of the release. A small amount of cash or a voucher equivalent would probably be appropriate. The assumption is that the larger the incentive, the greater the motivation of the intruder. For a supposedly watertight public dataset with near certainty of disclosure not being possible a larger sum could be used as an incentive.
Who would be suitable intruders?
The ideal intruder will differ for different datasets. All of the following criteria may not be met by any one potential intruder but they should satisfy some of them depending on the nature of the dataset. Ideal intruders are:
- “friendly” intruders who will behave as far as possible in a similar way to a real intruder without malicious intent
- people with sufficient knowledge of the data so as to understand variable and category meanings
- minority groups in relation to potential disclosure risks, for example, considering that disclosure may originate from ones or small counts in census data, then thinking about using people from diverse age groups or people from minority ethnic groups (taking steps to limit bias as far as possible by choosing a wide representative sample of intruders who might find different people in the data)
- those who already have security clearance
- people with sufficient IT skills to be able to penetrate the data, though we would not expect them to be professional hackers
- people from a range of geographical areas, both urban and rural
- people from different socio-economic backgrounds or social groups
- people of different ages, for example, younger age groups tend to have greater presence on social media
- experts from academia
- ethical hackers or people with a track record of being successful in intruding exercises
In general when thinking about which intruders to use, consider to whom the data will be made available, for example, data to be disseminated under licence might benefit from more expert intruders such as academics.
What data to use for intruder testing?
The data could either be:
- a safe or already disclosure-controlled dataset to test to ensure residual disclosure is at an appropriately low level
- a sensitive dataset, which has not yet been de-identified but that you want to ascertain which variables to include in a public or licensed version
Other data issues to consider
The statistical properties of the data
As a minimum, intruders should be told:
- whether the data are a sample and how large
- who the population represent
- what protection has been applied, if any
- full explanation of the variables and access to detailed definitions or categorisation
Response knowledge
It is sometimes sensible to provide the intruders with additional information from the dataset, which they might have knowledge about if they were an actual intruder, to simulate real conditions of prior knowledge or what might be in the public domain.
Examples are:
- provide names and addresses of a sample of respondents and ask intruders to try and match these back to the data; this is on the assumption that a real intruder might know a friend who has participated in the survey (for example)
- provide basic details on a business such as turnover “high”, “medium”, “low”, and number of employees “large”, “medium”, “low” based on the assumption of a competitor who knows basic details about another business; then the intruder attempts to match back this information to the data
Supplementary information
This requires an understanding of the law, policy and access requirements governing the data so that the criteria for what disclosure means in practice is clear.
It’s important to distinguish between personal and non-personal data, private and personal information. The 2007 Statistics Registration Service Act (SRSA), which is applicable to our outputs, discusses “personal information”, representing data under which disclosure should take account of only public information. The intruder’s private knowledge does not need to be considered when assessing disclosure risk for datasets released under licence.
The 1998 Data Protection Act, which is applicable to outputs from other government departments, refers to “personal data”, meaning data under which disclosure may be possible using public and/or private knowledge.
It’s important to establish the correct criteria for the intruder test. There may be specific laws relating to different data. Consider also how the data will be accessed and therefore what information may be used in that environment.
Simulating the mosaic or jigsaw effect
How can the comparison of multiple data sources that then facilitate disclosure be represented in an intruder test? Much public information is readily available on the internet. Possible examples (but not a complete list) to consider when testing social survey or census data are:
- social media: Facebook, LinkedIn, Twitter
- people finder or CV search
- estate agents
- genealogy websites
- business finder
- search tools such as meta search engines
- iannounce.co.uk to search for family announcements
- the Land Registry
- Google Earth
Test conditions
Time spent intruding should be considered as “reasonable effort”. This could be several hours or longer, depending on the data and in part depends on the intruder and whether they are making any progress.
It is a good idea for the intruder to build up profiles of respondents beforehand such as searching local papers for a really thorough analysis. Possible scenarios of respondents to look for might be:
- searching for people known to the respondent to confirm whether they are present in the data
- analysing the data for unique or rare individuals or units and then matching to real-world respondents
- looking for people in the public eye or those that might have unusual characteristics
- matching with another large, comparable dataset
What information to collect
In general the intruder should write down as much detail as possible, which should all be kept secure and destroyed securely once its purpose is exhausted. This includes ad hoc notes written during the test. In particular they should make a note of:
- variables used to make claims
- what new information was inferred that they think they have learned from the data
- all claims even if the intruder considers them unlikely to be correct
- names and addresses or anything to help validate the identity of a person or unit
- confidence levels
It is useful to give intruders a guide to help establish how confident they are in any claim. This is required when analysing the responses as claims made with high confidence are taken more seriously than those made with low confidence. A possible approach is shown in Table 1.
Publishing intruder testing results
A short summary document tabulating the claims of each (anonymised) intruder can be made available to individuals involved in the intruder testing programme. This will allow the results to be discussed in sufficient detail by relevant interested parties. Access to the more detailed notes made by intruders should be kept to a minimum. Identities of intruders should be kept as confidential as realistically possible. Any discussion of results should avoid the publication of outputs that show evidence of self-identification and associated characteristics. Intruder testing summaries should always draw attention to ethical considerations and security measures put in place.
In addition, the intruder should not be told whether their individual claims are correct. These results are to be used to assess the disclosure risk of the dataset in question and serve no additional analytic purpose. Disclosing information on any of the individual claims to anyone would introduce an increased risk of compromising confidentiality protection.
Interpretation of the results
Are the intruder testing results in line with expectations? It is hoped that intruders submit a wide range of claims with varying levels of confidence. This suggests a dataset with sufficient detail to be of use to researchers. The next step is to investigate the validity of these claims. There ought to be a balance between successful and unsuccessful claims, but too many claims suggest the data appears to be disclosive while too few suggest it is overprotected. The aim is not to release a dataset with zero risk so a good result would be if there were a small number of correct claims with low confidence. It can be helpful to show that some high-confidence claims were incorrect.
Intruder testing results should be used as a basis for reducing the risk in the dataset. The empirical evidence should complement the theoretical risk results.
Care needs to be taken to ensure that the data are not overprotected with the aim being for the Information Asset Owner (IAO) to find a suitable balance between risk and utility. The ultimate aim is to produce a dataset with relatively low risk but relatively high utility. The acceptable level of risk is related to the sensitivity of the data.
Even where the level of correct and incorrect claims appears to be acceptable, there is scope to consider whether there are specific parts of the dataset that may have increased risk. One should look at the types of claims, whether correct or incorrect, and more specifically the logic and sources that intruders used to make claims. For example, if age, sex and marital status are used as the basis for most of the claims, then one might look at the detail in those variables to ensure that the disclosure risk is sufficiently low.
Intruder testing limitations
How can we ensure that the “intruders” reflect the interests and capabilities of genuine intruders? An exact answer to this question is impossible but when assessing the results we need to bear in mind our results might not reflect reality.
There is no guarantee that actual intruders will be as observant of the legal and ethical issues as our chosen intruders. It is difficult, if not impossible, to mimic all the ways in which a malicious intruder may try to discover a disclosive element in the data.
Time constraints as applied as part of the intruder testing process are an artificial condition. In reality a determined intruder will have as much time as they consider necessary. All we can do as part of the exercise is to give our selected intruders a sufficient amount of time to attempt to identify a record or an associated attribute in the data. If an intruder requires more time maybe this suggests that the risk or utility relationship is satisfactory.
There may be general quality limitations. The selected intruders may not be sufficiently experienced to discover disclosure problems in the data. The additional information provided as response knowledge to the intruders might not be relevant.
Overall these limitations need to be placed in context and can be overcome with sufficient planning and a good knowledge of the data. Whilst intruder testing provides useful insights, the results need to be interpreted with caution. Empirical results should always be backed up with a theoretical risk assessment.
Nôl i'r tabl cynnwys7. Summary
As previously stated, intruder testing is a single step in the development and publication of microdata. As a reminder, here are the major steps required as part of the process.
- From the tabulation of the selected variable combinations look for low counts or other distinctive patterns.
- If there are unique or rare combinations in the data apply disclosure control on the problematic variable(s); these could include the removal of variables or records, recoding or record swapping.
- Tabulate the variable combinations again from the “protected” microdata; if there are no obvious disclosure issues go to step 4, otherwise repeat steps 2 and apply more disclosure control.
- Carry out intruder testing both internal and external; if there are many successful claims then steps 2 to 4 will need to be repeated.
- Publish the data under the required licence.
8. Glossary
Definitions of some of the words and phrases used in this article.
Identification key: A combination of variables used by an intruder when attempting to identify a record in the data. These are visible variables (such as age (or age group), sex, occupation, ethnic group), which could be found out by the intruder from direct observation.
Information Asset Owner: Senior individuals involved in running the relevant business. Their role is to understand what information is held, what is added and what is removed, how information is moved and who has access and why.
Intruder: An individual with access to the data who attempts to identify a respondent from the microdata and find additional information about them. Their aim is likely to be malicious with the aim being to discredit the data provider in particular or the government in general.
Key variable: A single variable, one of the identification key variables.
Mosaic or jigsaw effect: This is the result of considering multiple data sources together. Different data sources, each of which is non-disclosive when considered in isolation, may lead to disclosure concerns when looked at together.
Response knowledge: Knowledge by the intruder that an individual or household is present in the dataset. This may be knowledge of oneself or a relation or work colleague.
Nôl i'r tabl cynnwys