12 Data Anonymization
Learning Outcomes
By the end of this chapter, learners will:
- Understand what data is potentially sensitive
- Be able to use different approaches to anonymize data
Introduction
Although specific data about students and their use of OER can provide useful information for decision makers, it also presents a significant risk to the privacy of students. For such data to be shared, it should be anonymized to minimize these risks. If this data is not necessary for the project, it should not be collected in the first place; if it cannot be anonymized, it should not be widely shared except with the person’s consent.
Sensitive Data
Personally identifiable information is sensitive in that it can allow a person to be identified and matched to other data about them. This includes data that is directly identifiable, such as a student’s name or email, but also includes indirect data that can make someone identifiable by its level of detail and intersection relative to the size of the population. For example, if you identify a student’s program, year, gender, and country of origin, that could be sufficiently unique to identify that student, even if none of these details are personally identifiable on their own. This chapter on sensitive data has a detailed list of direct and indirectly identifiable data.
Keep in mind that even if you don’t request sensitive data, it may appear in free-text responses.
Anonymizing sensitive data
Anonymizing removes, changes, or limits the sharing of any information that would allow an individual’s data to be specifically tied to their identity.
Common techniques for anonymization include:
- Removal or non-reporting of data. For example, if you are quoting a response by Tracey Jones in a report, you could either assign her a pseudonym or a descriptor (eg “engineering undergraduate student”)
- Aggregating data to a level that it is not identifiable. For example, if you have only five students in a particular small program, it would be relatively easy to indirectly identify them. Instead you could report on the programs of a whole department in aggregate. This reduces the granularity of your analysis but protects the privacy of the students.
- Reporting data separately rather than by intersection. For example, report on the opinions of undergraduates, female students, and international students, rather than the opinions of undergraduate female international students.
This guide from the Future of Privacy Forum [PDF] outlines a spectrum of approaches to reducing the identifiability of collected data.
Even if you believe that you have anonymized your dataset, it may still be possible for it to be de-anonymized. Mathematical approaches have been developed to assess the risk of de-identification. This webinar on the Mathematics of Risk presents ways in which anonymization can be assessed.