Why avoid real names?
Using real names (or other personal identities) in research or related activities such as publishing on the web or in journals, presentations, and/or demonstrations can be problematic due to privacy concerns, existing data protection laws and regulations, and ethical obligations. Examples of highly sensitive data include student academic records, health records, and financial records.
To ensure the safety and privacy of all human subjects, many universities and other institutions establish oversight committees known as Institutional Review Boards (IRBs). The board is tasked with minimizing risk to participants. When using data collected from institutional participants, users are required to sign their proposals on the usage of the data, which, among other things, is required to include a declaration to use anonymous or pseudonymous names to avoid the use of real names.
The ‘randomNames’ R-Package
One of my favorite R Packages is the ‘randomNames’ package which is actually simple and easy to use. It has a single function allowing users to generate random first and last names.
Here are a few simple usage examples.
Generate 6 random names as “Last, First”
library(randomNames)
randomNames(6)
## [1] "Smith, Heather" "Cao, Pauline Cathrina" "Del Rosario, Emily"
## [4] "Le, Rachael" "Mahooty, Deja" "el-Muhammad, Shaakir"
Generate 6 random female names as “Last, First”
randomNames(6, gender = 1)
## [1] "al-Malak, Umaira" "Nguyen, Annabelle" "Ives, Hannah"
## [4] "al-Fahs, Anbara" "Green, Genesis" "Abraham, Jenna"
Generate 6 random names with half of them male names as “Last, First”
randomNames(6, gender = c(0,0,0,1,1,1))
## [1] "Trevino, Austin" "Wanninger, Drake" "Cherrington, Larry"
## [4] "Lambouths, Shadenia" "Jackson, Joy" "Hoots, Emily"
Generate 6 random names with three of them African Americans and the other three Whites (not Hispanic) as “Last, First”
randomNames(6, ethnicity = c(3,3,3,5,5,5))
## [1] "Thurman, Kevin" "Buice, Alexus" "Crowder, Emmanuelle"
## [4] "Yacovetta, Bailey" "Maloy, Kyle" "Bourbeau, Tyler"
More example
Below is an example of a file which is publicly available by X-University. It is a schedule of classes at a fall semester generated by a software by a company called Ellucian. The data has variable names such as enrollment number (ENRLD), instructor name (INSTRUCTOR), credit hours (HRS), etc, for schedule of classes. Many higher education institutions use the same software to generate class schedules hence have similar outputs. Here you can find some examples of similar outputs: Savannah State University Spring 2024 Class Schedule, Fort Valley State University Class Schedule, Benedict College Class Schedule Fall 2022.Some of them allow users to download the schedules as an excel or csv file. One can also use cut/paste to use the data.
For the purpose of this demonstration, below is a schedule of spring 2024 classes by X-University; and we want to change instructor names in the schedule by random names.
library(dplyr)
schedule<-read.csv("X-College_Spring_2024_Classes_Schedule.csv")
schedule<-subset(schedule,INSTRUCTOR !="") #skip classes with no instructor
schedule<-filter(schedule, P.of.T==1) #Session 1 Classes only
head(schedule)
## P.of.T SUBJ NUMB TITLE HRS ENRLD MAXENRL TIMES
## 1 1 ACCT 2101 PRINCIPLES OF ACCOUNTING I 3 28 30
## 2 1 ACCT 2101 PRINCIPLES OF ACCOUNTING I 3 30 30 11:00-11:50
## 3 1 ACCT 2102 PRINCIPLES OF ACCOUNTING II 3 16 30
## 4 1 ACCT 2102 PRINCIPLES OF ACCOUNTING II 3 15 30 12:30-01:45
## 5 1 ACCT 3103 INTERM ACCOUNTING I 3 7 30 03:30-04:45
## 6 1 ACCT 4123 COST ACCOUNTING 3 9 30 05:15-06:30
## DAYS INSTRUCTOR
## 1 Haile, Brandie
## 2 M W F Lopez, Elizabeth
## 3 Haile, Brandie
## 4 T R Lopez, Elizabeth
## 5 T R Haile, Brandie
## 6 M W el-Hassan, Nawaar
Instructor names show under the variable ‘INSTRUCTOR’ and we want to change them by random names while keeping everything else in the data unchanged.
original_names<-unique(schedule$INSTRUCTOR)
random_names<-randomNames(length(original_names)) #Random names to replace original names
for(i in 1:length(original_names))
{
schedule[schedule == original_names[i]]<-random_names[i]
}
head(schedule)
## P.of.T SUBJ NUMB TITLE HRS ENRLD MAXENRL TIMES
## 1 1 ACCT 2101 PRINCIPLES OF ACCOUNTING I 3 28 30
## 2 1 ACCT 2101 PRINCIPLES OF ACCOUNTING I 3 30 30 11:00-11:50
## 3 1 ACCT 2102 PRINCIPLES OF ACCOUNTING II 3 16 30
## 4 1 ACCT 2102 PRINCIPLES OF ACCOUNTING II 3 15 30 12:30-01:45
## 5 1 ACCT 3103 INTERM ACCOUNTING I 3 7 30 03:30-04:45
## 6 1 ACCT 4123 COST ACCOUNTING 3 9 30 05:15-06:30
## DAYS INSTRUCTOR
## 1 Doan, Reyana
## 2 M W F Hai, Kanoa
## 3 Doan, Reyana
## 4 T R Hai, Kanoa
## 5 T R Doan, Reyana
## 6 M W Daniels, Courtney
There we have it, all instructor names under “INSTRUCTOR” in the data are replaced by random names!
References
[1]
randomNames package was created by Damian Betebenner and the repository is available in GitHub at randomNames (version 1.6-0.0).