Recoding a multiple choice item with a text field

Benjamin Becker, Johanna Busse

2024-10-09

If a multiple choice item is administered, sometimes not all possible answers can be covered by predefined response options. In such cases, often an additional response option (e.g. “other”) is given accompanied by an open text field. An example of such a multiple choice item is asking for the languages a person is able to speak:

 

 

In the resulting data set, such an item will often be stored as multiple separate variables: dichotomous and numeric (‘dummy’) variables for each multiple choice option (with variable labels describing the response option) and an additional character variable (containing the answers in the text field). For data analysis it is usually necessary to integrate the information from the character variable into the dummy variables. Often the following steps are required:

To illustrate the steps we have implemented a small SPSS example data set in this package. The data set can be loaded using the import_spss() function. For further information on importing SPSS data see import_spss: Importing data from ‘SPSS’. Note that the data set is a minimal working example, containing only the required variables for this illustration.

library(eatGADS)
data_path <- system.file("extdata", "multipleChoice.sav", package = "eatGADS")
gads <- import_spss(data_path)

# Show example data set
gads
#> $dat
#>    ID mcvar1 mcother      stringvar
#> 1   1      1     -94         German
#> 2   2    -94       0            Ger
#> 3   3      0       1            Ger
#> 4   4      1     -94               
#> 5   5    -94       0  Eng, Pol, Ita
#> 6   6      0       1 Pol, Ita, Germ
#> 7   7      1     -94       eng, ita
#> 8   8    -94       0      germ, pol
#> 9   9      0       1         polish
#> 10 10      1     -94       eng, ita
#> 11 11    -94       0            -99
#> 12 12      0       1      Star Trek
#> 
#> $labels
#>     varName         varLabel format display_width labeled value          valLabel missings
#> 1        ID             <NA>   F8.0            NA      no    NA              <NA>     <NA>
#> 2    mcvar1 Language: German   F8.2            NA     yes   -94           missing     miss
#> 3    mcvar1 Language: German   F8.2            NA     yes     0                no    valid
#> 4    mcvar1 Language: German   F8.2            NA     yes     1               yes    valid
#> 5   mcother  Language: other   F8.2            NA     yes   -94           missing     miss
#> 6   mcother  Language: other   F8.2            NA     yes     0                no    valid
#> 7   mcother  Language: other   F8.2            NA     yes     1               yes    valid
#> 8 stringvar   Language: text    A14            NA     yes   -99 missing by design    valid
#> 
#> attr(,"class")
#> [1] "GADSdat" "list"

The variable names of the data set above are connected to the multiple choice question as indicated:

 

Preparing the data set

As illustrated, data can be loaded into R in the GADSdat format via the functions import_spss(),import_DF() or import_raw(). Depending on the original format, omitted responses to open text fields might be stored as empty strings instead of NAs. In these cases, the recode2NA() function should be used to recode these values to NA. Per default, matching strings across all variables in the data set are recoded. Specific variables selection can be specified using the recodeVars argument. Note that the function only performs recodings to exact matches of a single, specific string (in our example "").

gads <- recode2NA(gads, value = "")
#> Recodes in variable ID: 0
#> Recodes in variable mcvar1: 0
#> Recodes in variable mcother: 0
#> Recodes in variable stringvar: 1

Creating and editing a lookup table

With createLookup(), you can create a lookup table which allows recoding one or multiple variables.
You can choose which string variables in a GADSdat object you would like to recode by using the recodeVars argument. The resulting lookup table is a long format data.frame with rows being variable x value pairings. In case you want to sort the output to make recoding easier, the argument sort_by can be used. Extra columns can be added to the lookup table by the argument addCols (but can also be added later manually e.g. in Excel). As test takers can insert multiple languages in the text field, you have to add multiple recode columns to the lookup table. The respective column names are irrelevant and just for convenience purpose.

lookup <- createLookup(GADSdat = gads, recodeVars = "stringvar", sort_by = 'value', 
                       addCols = c("language", "language2", "language3"))

lookup
#>     variable          value language language2 language3
#> 1  stringvar           <NA>       NA        NA        NA
#> 2  stringvar            -99       NA        NA        NA
#> 3  stringvar  Eng, Pol, Ita       NA        NA        NA
#> 4  stringvar            Ger       NA        NA        NA
#> 5  stringvar         German       NA        NA        NA
#> 6  stringvar Pol, Ita, Germ       NA        NA        NA
#> 7  stringvar      Star Trek       NA        NA        NA
#> 8  stringvar       eng, ita       NA        NA        NA
#> 9  stringvar      germ, pol       NA        NA        NA
#> 10 stringvar         polish       NA        NA        NA

Now you have to add the desired values for recoding. You should use (a) unique parts of the existing variable labels of the corresponding dummy variables (see the next section for explanation) and (b) consistent new values that can serve as variable labels later. Spelling mistakes within the recoding will result in additional columns in the final data set! If there are less values than columns you can leave the remaining columns NA.

To fill in the columns you could use R directly to modify the columns. Alternatively, we recommend using eatAnalysis::write_xlsx() to create an Excel file in which you can fill in the values.

# write lookup table to Excel
eatAnalysis::write_xlsx(lookup, "lookup_forcedChoice.xlsx")

 

After filling out the Excel sheet the lookup table might look like this:

 

 

The Excel file can be read back into R via readxl::read_xlsx(). If you want to create specific missing codes, you have to insert the desired (numerical!) missing codes into all columns (e.g. -96 in the lookup table below). The corresponding value labels will be assigned in a later step.

# write lookup table to Excel
eatAnalysis::write_xlsx(lookup, "lookup_multipleChoice.xlsx")

### perform recodes in Excel sheet!

# read lookup table back to R
lookup <- readxl::read_xlsx("lookup_multipleChoice.xlsx")
lookup
#>     variable          value language language2 language3
#> 1  stringvar           <NA>     <NA>      <NA>      <NA>
#> 2  stringvar            -99     <NA>      <NA>      <NA>
#> 3  stringvar  Eng, Pol, Ita  English    Polish   Italian
#> 4  stringvar            Ger   German      <NA>      <NA>
#> 5  stringvar         German   German      <NA>      <NA>
#> 6  stringvar Pol, Ita, Germ   Polish   Italian    German
#> 7  stringvar      Star Trek      -96       -96       -96
#> 8  stringvar       eng, ita  English   Italian      <NA>
#> 9  stringvar      germ, pol   German    Polish      <NA>
#> 10 stringvar         polish   Polish      <NA>      <NA>

Apply lookup to GADSdat

You perform the actual data recoding using the applyLookup_expandVar() function. It applies the recodes defined in the lookup table, thereby creating as many character variables as there are additional columns in the lookup table. Variable names are generated automatically.

gads_string <- applyLookup_expandVar(GADSdat = gads, lookup = lookup)
#> Warning in check_lookup(lookup, GADSdat): Not all values have a recode value assigned (missings in
#> value_new).
#> No rows removed from meta data.
#> Adding meta data for the following variables: stringvar_1
#> No rows removed from meta data.
#> Adding meta data for the following variables: stringvar_2
#> No rows removed from meta data.
#> Adding meta data for the following variables: stringvar_3

gads_string$dat
#>    ID mcvar1 mcother      stringvar stringvar_1 stringvar_2 stringvar_3
#> 1   1      1     -94         German      German        <NA>        <NA>
#> 2   2    -94       0            Ger      German        <NA>        <NA>
#> 3   3      0       1            Ger      German        <NA>        <NA>
#> 4   4      1     -94           <NA>        <NA>        <NA>        <NA>
#> 5   5    -94       0  Eng, Pol, Ita     English      Polish     Italian
#> 6   6      0       1 Pol, Ita, Germ      Polish     Italian      German
#> 7   7      1     -94       eng, ita     English     Italian        <NA>
#> 8   8    -94       0      germ, pol      German      Polish        <NA>
#> 9   9      0       1         polish      Polish        <NA>        <NA>
#> 10 10      1     -94       eng, ita     English     Italian        <NA>
#> 11 11    -94       0            -99        <NA>        <NA>        <NA>
#> 12 12      0       1      Star Trek         -96         -96         -96

In some cases you might have recoded some of the values to specific missing codes. These missing codes have to be now specified by hand as value labels that should be treated as missings. The function changeValLabels() is used to give specific value labels and the function changeMissings() attaches missing codes. The loop below performs the appropriate labeling and missing coding for all three new string variables.

for(nam in paste0("stringvar_", 1:3)) {
  gads_string <- changeValLabels(gads_string, varName = nam, 
                                 value = -96, valLabel = "Missing: Not codeable")
  gads_string <- changeMissings(gads_string, varName = nam, 
                                value = -96, missings = "miss")
}

gads_string$labels
#>        varName         varLabel format display_width labeled value              valLabel missings
#> 1           ID             <NA>   F8.0            NA      no    NA                  <NA>     <NA>
#> 2       mcvar1 Language: German   F8.2            NA     yes   -94               missing     miss
#> 3       mcvar1 Language: German   F8.2            NA     yes     0                    no    valid
#> 4       mcvar1 Language: German   F8.2            NA     yes     1                   yes    valid
#> 5      mcother  Language: other   F8.2            NA     yes   -94               missing     miss
#> 6      mcother  Language: other   F8.2            NA     yes     0                    no    valid
#> 7      mcother  Language: other   F8.2            NA     yes     1                   yes    valid
#> 8    stringvar   Language: text    A14            NA     yes   -99     missing by design    valid
#> 9  stringvar_1   Language: text    A14            NA     yes   -99     missing by design    valid
#> 10 stringvar_1   Language: text    A14            NA     yes   -96 Missing: Not codeable     miss
#> 11 stringvar_2   Language: text    A14            NA     yes   -99     missing by design    valid
#> 12 stringvar_2   Language: text    A14            NA     yes   -96 Missing: Not codeable     miss
#> 13 stringvar_3   Language: text    A14            NA     yes   -99     missing by design    valid
#> 14 stringvar_3   Language: text    A14            NA     yes   -96 Missing: Not codeable     miss

Match values to variable labels

When integrating character variables into multiple dummy variables, there has to be a clear correspondence between values in the character variable and dummy variables. eatGADS requires this information as a named character vector with the dummy variable names as values and values of the text variable as names. Such a vector can be automatically generated by the matchValues_varLabels() function. The function takes a character vector (values) as input and matches all values in this vector to the variable labels of the dummy variables (mc_vars). We provide the content of the character variables as input for the values argument as these are all possible new values.

In case that not every already existing variable label is part of the lookup table you can use the label_by_hand argument. This is always the case for the variable representing the other response option but might be necessary for other response options as well. Alternatively, these values could be added to the value_string as well, to enable automatic matching.

value_string <- c(lookup$language, lookup$language2, lookup$language3)
named_char_vec <- matchValues_varLabels(GADSdat = gads_string, 
                                        mc_vars = c("mcvar1", "mcother"), 
                                        values = value_string, 
                                        label_by_hand = c("other"="mcother"))
named_char_vec
#>    German     other 
#>  "mcvar1" "mcother"

Integrate character and numeric variables

By using the expanded GADSdat object and the named character vector you can collapse the information of the strings with the already existing numeric variables. The following coding of the binary numeric variables is required: 1 = true and 0 = false (for recoding see recodeGADS()). The names of the text variables are specified under text_vars.

If there is an entry in the text variables that matches one of the binary numeric variables, this binary numeric variable will be set to 1. The variable which indicates entries in the text variable (mc_var_4text) is recoded accordingly. If for a row all entries in the text variable can be recoded into the binary numeric variables, the invalid_miss_code is inserted into the text variables and mc_var_4text is changed to 0. If there are valid entries beside the binary numeric variables mc_var_4text is set to 1. If there were no valid entries in text_vars to begin with, mc_var_4text is left as is. All empty entries in the text_vars are assigned missing codes (notext_miss_code).

gads_string2 <- collapseMultiMC_Text(GADSdat = gads_string, mc_vars = named_char_vec, 
                                     text_vars = c("stringvar_1", "stringvar_2", "stringvar_3"), 
                                     mc_var_4text = "mcother", var_suffix = "_r", 
                                     label_suffix = "(recoded)",
                                     invalid_miss_code = -98, 
                                     invalid_miss_label = "Missing: By intention",
                                     notext_miss_code = -99, 
                                     notext_miss_label = "Missing: By intention")
#> No rows removed from meta data.
#> Adding meta data for the following variables: mcvar1_r, mcother_r, stringvar_1_r, stringvar_2_r, stringvar_3_r

gads_string2$dat
#>    ID mcvar1 mcother      stringvar stringvar_1 stringvar_2 stringvar_3 mcvar1_r mcother_r
#> 1   1      1     -94         German      German        <NA>        <NA>        1         0
#> 2   2    -94       0            Ger      German        <NA>        <NA>        1         0
#> 3   3      0       1            Ger      German        <NA>        <NA>        1         0
#> 4   4      1     -94           <NA>        <NA>        <NA>        <NA>        1       -94
#> 5   5    -94       0  Eng, Pol, Ita     English      Polish     Italian      -94         1
#> 6   6      0       1 Pol, Ita, Germ      Polish     Italian      German        1         1
#> 7   7      1     -94       eng, ita     English     Italian        <NA>        1         1
#> 8   8    -94       0      germ, pol      German      Polish        <NA>        1         1
#> 9   9      0       1         polish      Polish        <NA>        <NA>        0         1
#> 10 10      1     -94       eng, ita     English     Italian        <NA>        1         1
#> 11 11    -94       0            -99        <NA>        <NA>        <NA>      -94         0
#> 12 12      0       1      Star Trek         -96         -96         -96        0         1
#>    stringvar_1_r stringvar_2_r stringvar_3_r
#> 1            -98           -98           -98
#> 2            -98           -98           -98
#> 3            -98           -98           -98
#> 4            -99           -99           -99
#> 5        English        Polish       Italian
#> 6         Polish       Italian           -99
#> 7        English       Italian           -99
#> 8         Polish           -99           -99
#> 9         Polish           -99           -99
#> 10       English       Italian           -99
#> 11           -99           -99           -99
#> 12           -96           -96           -96

Trim down variables

Sometimes the number of additional entries should be limited (as theoretically there can be infinite additional entries). This means that the number of character variables is ‘trimmed’. remove2NAchar() performs this trimming. Via max_num the maximum number of text variables is defined and all text variables above this number are removed from the data set. If a row in the data set contains valid entries in one of the removed variables, a specific missing code (na_value) is inserted into this row on all remaining text variables.

gads_string3 <- remove2NAchar(GADSdat = gads_string2, 
                              vars = c("stringvar_1_r", "stringvar_2_r", "stringvar_3_r"), 
                              max_num = 2, na_value = -97, 
                              na_label = "missing: excessive answers")
#> Removing the following rows from meta data: stringvar_3_r
#> No rows added to meta data.

gads_string3$dat
#>    ID mcvar1 mcother      stringvar stringvar_1 stringvar_2 stringvar_3 mcvar1_r mcother_r
#> 1   1      1     -94         German      German        <NA>        <NA>        1         0
#> 2   2    -94       0            Ger      German        <NA>        <NA>        1         0
#> 3   3      0       1            Ger      German        <NA>        <NA>        1         0
#> 4   4      1     -94           <NA>        <NA>        <NA>        <NA>        1       -94
#> 5   5    -94       0  Eng, Pol, Ita     English      Polish     Italian      -94         1
#> 6   6      0       1 Pol, Ita, Germ      Polish     Italian      German        1         1
#> 7   7      1     -94       eng, ita     English     Italian        <NA>        1         1
#> 8   8    -94       0      germ, pol      German      Polish        <NA>        1         1
#> 9   9      0       1         polish      Polish        <NA>        <NA>        0         1
#> 10 10      1     -94       eng, ita     English     Italian        <NA>        1         1
#> 11 11    -94       0            -99        <NA>        <NA>        <NA>      -94         0
#> 12 12      0       1      Star Trek         -96         -96         -96        0         1
#>    stringvar_1_r stringvar_2_r
#> 1            -98           -98
#> 2            -98           -98
#> 3            -98           -98
#> 4            -99           -99
#> 5            -97           -97
#> 6         Polish       Italian
#> 7        English       Italian
#> 8         Polish           -99
#> 9         Polish           -99
#> 10       English       Italian
#> 11           -99           -99
#> 12           -96           -96

Multiple character variables to labeled integers

After using collapseMultiMC_Text() (and remove2NAchar()), only new, additional values are left in the character variables. multiChar2fac() transforms these remaining text variables to numeric, labeled variables. All resulting labeled variables share the exact same value labels, which are sorted alphabetically.

gads_numeric <- multiChar2fac(GADSdat = gads_string3, vars = c("stringvar_1_r", "stringvar_2_r"), 
                              var_suffix = "_r", label_suffix = "(recoded)")

gads_numeric$dat
#>    ID mcvar1 mcother      stringvar stringvar_1 stringvar_2 stringvar_3 mcvar1_r mcother_r
#> 1   1      1     -94         German      German        <NA>        <NA>        1         0
#> 2   2    -94       0            Ger      German        <NA>        <NA>        1         0
#> 3   3      0       1            Ger      German        <NA>        <NA>        1         0
#> 4   4      1     -94           <NA>        <NA>        <NA>        <NA>        1       -94
#> 5   5    -94       0  Eng, Pol, Ita     English      Polish     Italian      -94         1
#> 6   6      0       1 Pol, Ita, Germ      Polish     Italian      German        1         1
#> 7   7      1     -94       eng, ita     English     Italian        <NA>        1         1
#> 8   8    -94       0      germ, pol      German      Polish        <NA>        1         1
#> 9   9      0       1         polish      Polish        <NA>        <NA>        0         1
#> 10 10      1     -94       eng, ita     English     Italian        <NA>        1         1
#> 11 11    -94       0            -99        <NA>        <NA>        <NA>      -94         0
#> 12 12      0       1      Star Trek         -96         -96         -96        0         1
#>    stringvar_1_r stringvar_2_r stringvar_1_r_r stringvar_2_r_r
#> 1            -98           -98             -98             -98
#> 2            -98           -98             -98             -98
#> 3            -98           -98             -98             -98
#> 4            -99           -99             -99             -99
#> 5            -97           -97             -97             -97
#> 6         Polish       Italian               3               2
#> 7        English       Italian               1               2
#> 8         Polish           -99               3             -99
#> 9         Polish           -99               3             -99
#> 10       English       Italian               1               2
#> 11           -99           -99             -99             -99
#> 12           -96           -96             -96             -96

gads_final <- gads_numeric
extractMeta(gads_final)[, c("varName", "value", "valLabel", "missings")]
#>            varName value                   valLabel missings
#> 1               ID    NA                       <NA>     <NA>
#> 2           mcvar1   -94                    missing     miss
#> 3           mcvar1     0                         no    valid
#> 4           mcvar1     1                        yes    valid
#> 5          mcother   -94                    missing     miss
#> 6          mcother     0                         no    valid
#> 7          mcother     1                        yes    valid
#> 8        stringvar   -99          missing by design    valid
#> 9      stringvar_1   -99          missing by design    valid
#> 10     stringvar_1   -96      Missing: Not codeable     miss
#> 11     stringvar_2   -99          missing by design    valid
#> 12     stringvar_2   -96      Missing: Not codeable     miss
#> 13     stringvar_3   -99          missing by design    valid
#> 14     stringvar_3   -96      Missing: Not codeable     miss
#> 15        mcvar1_r   -94                    missing     miss
#> 16        mcvar1_r     0                         no    valid
#> 17        mcvar1_r     1                        yes    valid
#> 18       mcother_r   -94                    missing     miss
#> 19       mcother_r     0                         no    valid
#> 20       mcother_r     1                        yes    valid
#> 21   stringvar_1_r   -99      Missing: By intention     miss
#> 22   stringvar_1_r   -98      Missing: By intention     miss
#> 23   stringvar_1_r   -97 missing: excessive answers     miss
#> 24   stringvar_1_r   -96      Missing: Not codeable     miss
#> 25   stringvar_2_r   -99      Missing: By intention     miss
#> 26   stringvar_2_r   -98      Missing: By intention     miss
#> 27   stringvar_2_r   -97 missing: excessive answers     miss
#> 28   stringvar_2_r   -96      Missing: Not codeable     miss
#> 29 stringvar_1_r_r   -99      Missing: By intention     miss
#> 30 stringvar_1_r_r   -98      Missing: By intention     miss
#> 31 stringvar_1_r_r   -97 missing: excessive answers     miss
#> 32 stringvar_1_r_r   -96      Missing: Not codeable     miss
#> 33 stringvar_1_r_r     1                    English    valid
#> 34 stringvar_1_r_r     2                    Italian    valid
#> 35 stringvar_1_r_r     3                     Polish    valid
#> 36 stringvar_2_r_r   -99      Missing: By intention     miss
#> 37 stringvar_2_r_r   -98      Missing: By intention     miss
#> 38 stringvar_2_r_r   -97 missing: excessive answers     miss
#> 39 stringvar_2_r_r   -96      Missing: Not codeable     miss
#> 40 stringvar_2_r_r     1                    English    valid
#> 41 stringvar_2_r_r     2                    Italian    valid
#> 42 stringvar_2_r_r     3                     Polish    valid

Clean data set

In a last step you can remove unnecessary variables from the GADSdat object by using removeVars().

gads_final2 <- removeVars(gads_final, vars = c("stringvar_1", "stringvar_2", "stringvar_3",
                                               "stringvar_1_r", "stringvar_2_r"))
#> Removing the following rows from meta data: stringvar_1, stringvar_2, stringvar_3, stringvar_1_r, stringvar_2_r
#> No rows added to meta data.