Anonymising data using R

I often work with student data for research purposes, and one of the first steps I take before doing any analyses is to remove identifying details.

Our students have a “University User Name” (UUN) which is an S followed by a 7-digit number (e.g. “S1234567”), which can be used to link different datasets together. I need to replace these identifiers with new IDs, so that the resulting datasets have no personal identifiers in them.

I’ve written a simple R script that can read in multiple .csv files, and replace the identifiers in a consistent way. It also produces a lookup table so that I can de-anonymise the data if needed. But the main thing is that it produces new versions of all the .csv files with all personal identifiers removed!

Code on GitHub