Introduction

This vignette explains basic functionalities of the package litRiddle, a part of the Riddle of Literary Quality project.

The package contains the data of a reader survey about fiction in Dutch, a description of the novels the readers rated, and the results of stylistic measurements of the novels. The package also contains functions to combine, analyze, and visualize these data.

See: https://literaryquality.huygens.knaw.nl/ for further details. Information in Dutch about the package can be found at https://karinavdo.github.io/RaadselLiteratuur/02_07_data_en_R_package.html.

These data are also available as individual csv files for persons wanting to work with the data in non R environments. See: https://github.com/karinavdo/RiddleData.

If you use litRiddle in your academic publications, please consider citing the following references:

Maciej Eder, Lensink, S., van Zundert, J.J., and van Dalen-Oskam, K.H. (2022). Replicating The Riddle of Literary Quality: The LitRiddle Package for R. In Digital Humanities 2022 Conference Abstracts, 636–637. Tokyo: The University of Tokyo / DH2022 Local Organizing Committee. https://dh2022.dhii.asia/abstracts/files/VAN_DALEN_OSKAM_Karina_Replicating_The_Riddle_of_Literary_Qu.html

Karina van Dalen-Oskam (2023). The Riddle of Literary Quality: A Computational Approach. Amsterdam University Press.

Installation

Install the package from the CRAN repository:

install.packages("litRiddle")

Alternatively, try installing it directly from the GitHub repository:

library(devtools)
install_github("karinavdo/LitRiddleData", build_vignettes = TRUE)

Usage

First, one has to activate the package so that its functions become visible to the user:

library(litRiddle)

## litRiddle version: 1.0.0
## 
## Thank you for working with the litRiddle package. We would greatly
## appreciate properly citing this software if you find it of use.
## When citing, you can refer to this software as:
##   Eder, M., van Zundert, J., Lensink, S., and van Dalen-Oskam, K. (2023). 
##   litRiddle: Dataset and Tools to Research the Riddle of Literary Quality
##   CRAN. <https://CRAN.R-project.org/package=litRiddle>
## If you prefer to cite a publication instead, here is our suggestion:
##   Eder, M., Lensink, S., Van Zundert, J.J., and Van Dalen-Oskam, K.H. 
##   (2022). Replicating The Riddle of Literary Quality: The LitRiddle 
##   Package for R. In Digital Humanities 2022 Conference Abstracts, 
##   636-637. Tokyo: The University of Tokyo, 
##   <https://dh2022.dhii.asia/abstracts/163>.
## 
## To get full BibTeX entry, type: citation("litRiddle").

The dataset

To activate the dataset, type one of the following lines (or all of them):

data(books)
data(respondents)
data(reviews)
data(motivations)
data(frequencies)

From now on, the dataset, divided into four data tables, is visible for the user. Please note that the functions discussed below do not need the dataset to be activated (they take care of it themselves), therefore you don’t have to remember about this step if you plan to analyze the data using the functions from the package.

Time to explore some of the data tables. This generic function will list all the data points from the table books:

books

This command will dump quite a lot of stuff on the screen offering little insight or overview. It’s usually a better idea to select one portion of information at a time, usually one variable or one observation. We assume here that the user has some basic knowledge about R, and particularly s/he knows how to access values in vectors and tables (matrices). To get the titles of the books scored in the survey (or, say, the first 10 titles), one might type:

books$title[1:10]

##  [1] Haar naam was Sarah           Duel                         
##  [3] Het Familieportret            De kraai                     
##  [5] Mannen die vrouwen haten      Heldere hemel                
##  [7] Vijftig tinten grijs          Gerechtigheid                
##  [9] De verrekijker                De vrouw die met vuur speelde
## 399 Levels: 1 Fifth Avenue 13 uur 1953 1q84 22/11/63 50/50 Moorden ... Zwarte piste

Well, but how do I know that the name of the particular variable I want to get is title, rather than anything else? There exists a function that lists all the variables from the three data tables.

Print column names

The function that creates a list of all the column names from all three datasets is named get.columns() and needs no arguments to be run. What it means is that you simply type the following code, remembering about the parentheses at the end of the function:

get.columns()

## $books
##  [1] "short.title"              "author"                  
##  [3] "title"                    "title.english"           
##  [5] "genre"                    "book.id"                 
##  [7] "riddle.code"              "riddle.code.english"     
##  [9] "translated"               "gender.author"           
## [11] "origin.author"            "original.language"       
## [13] "inclusion.criterion"      "publication.date"        
## [15] "first.print"              "publisher"               
## [17] "word.count"               "type.count"              
## [19] "sentence.length.mean"     "sentence.length.variance"
## [21] "paragraph.count"          "sentence.count"          
## [23] "paragraph.length.mean"    "raw.TTR"                 
## [25] "sampled.TTR"             
## 
## $respondents
##  [1] "respondent.id"     "gender.resp"       "age.resp"         
##  [4] "zipcode"           "education"         "education.english"
##  [7] "books.per.year"    "typically.reads"   "how.literary"     
## [10] "s.4a1"             "s.4a2"             "s.4a3"            
## [13] "s.4a4"             "s.4a5"             "s.4a6"            
## [16] "s.4a7"             "s.4a8"             "s.12b1"           
## [19] "s.12b2"            "s.12b3"            "s.12b4"           
## [22] "s.12b5"            "s.12b6"            "s.12b7"           
## [25] "s.12b8"            "remarks.survey"    "date.time"        
## [28] "week.nr"           "day"              
## 
## $reviews
## [1] "respondent.id"        "book.id"              "quality.read"        
## [4] "literariness.read"    "quality.notread"      "literariness.notread"
## [7] "book.read"           
## 
## $motivations
## [1] "motivation.id" "respondent.id" "book.id"       "paragraph.id" 
## [5] "sentence.id"   "token"         "lemma"         "upos"

Not bad indeed. However, how can I know what s.4a2 stands for?

Explain variables

Function that lists an short explanation of what the different column names refer to and what their levels consist of is called explain(). To work properly, this function needs an argument to be passed, which basically mean that the user has to specify which dataset s/he is interested in. The options are as follows:

explain("books")

## The 'books' dataset contains information on several details of the 401 
## different books used in the survey.
##         
## Here follows a list with the different column names and an explanation of
## the information they contain:
## 
## 1. short.title                A short name containing the author's name and 
##                               (a part of) the title;
## 2. author                     Last name and first name of the author;
## 3. title                      Full title of the book;
## 4. title.english              Full title of the book in English;
## 5. genre                      Genre of the book. 
##                               There are four different genres:
##                               a) Fiction; b) Romantic; c) Suspense; d) Other;
## 6. book.id                    Unique number to identify each book;
## 7. riddle.code                More complete list of genres of the books. 
##                               Contains 13 categories --- to see which, type
##                               `levels(books$riddle.code)` in the terminal;
## 8. riddle.code.english        Translation of code in column 7 in English;
## 9. translated                 'yes' if the book has been translated,
##                               'no' if not;
## 10. gender.author              The gender of the author: 
##                               Female, Male, Unknown/Multiple;
## 11. origin.author             The country of origin of the author. 
##                               Note that short country codes have been used 
##                               instead of the full country names;
## 12. original.language         The original language of the book. Note that short
##                               language codes have been used, instead of the full
##                               language names;
## 13. inclusion.criterion       In what category a book has been placed, either
##                               a) bestseller; b) boekenweekgeschenk; 
##                               c) library; or d) literair juweeltje;
## 14. publication.date          Publication date of the book, YYYY-MM-DD format;
## 15. first.print               Year in which the first print appeared;
## 16. publisher                 Publishers of the books;
## 17. word.count                Word count, or total number of words (tokens)
##                               used in a book;
## 18. type.count                Total number of unique words (types) used in book;
## 19. sentence.length.mean      Average sentence length in a book (in words);
## 20. sentence.length.variance  Standard deviation of the sentence length;
## 21. paragraph.count           Total number of paragraphs in a book;
## 22. sentence.count            Total number of sentences in a book;
## 23. paragraph.length.mean     Average paragraph length in a book (in words); 
## 24. raw.TTR                   Lexical diversity, or type-token ratio, which 
##                               gives an indication of how diverse the word use 
##                               in a book is;
## 25. sampled.TTR               Unlike the raw type-token ratio, the sampled 
##                               TTR is significantly more resistant to text 
##                               size, and thus it should be preferred over the 
##                               raw TTR.

explain("reviews")

## The 'reviews' dataset contains four different ratings that were given 
## to 401 different books.
##         
## Here follows a list with the different column names and an explanation of
## what information they contain:
## 
## 1. respondent.id          Unique number for each respondent of the survey;
## 2. book.id                Unique number to identify each book;
## 3. quality.read           Rating on the quality of a book that a respondent
##                           has read. Scale from 1 - 7, with 1 meaning 
##                           'very bad' and 7 meaning 'very good';
## 4. literariness.read      Rating on how literary a respondent found a book
##                           that s/he has read. Scale from 1 - 7, with 1 meaning 
##                           'not literary at all' and 7 meaning 'very literary';
## 5. quality.notread        Rating on the quality of a book that a respondent
##                           has not read. Scale from 1 - 7, with 1 meaning 
##                           'very bad' and 7 meaning 'very good';
## 6. literariness.notread   Rating on how literary a respondent found a book that
##                           s/he has not read. Scale from 1 - 7, with 1 meaning 
##                           'not literary at all' and 7 meaning 'very literary';
## 7. book.read              1 or 0: 1 indicates that the respondent read 
##                           the book, 0 indicates the respondent did not 
##                           read the book but had an opinion about 
##                           the literary quality of the book.

explain("respondents")

## The 'respondents' dataset contains information on the people that participated 
## in the survey.
##         
## Here follows a list with the different column names and an explanation of
## what information they contain:
## 
## 1. respondent.id      Unique number for each respondent of the survey;
## 2. gender.resp        Gender of the respondent: Female, Male, NA;
## 3. age.resp           Age of the respondent;
## 4. zipcode            Zip code of the respondent;
## 5. education          Education level, containing 8 levels (see which
##                       levels by typing 'levels(respondents$education)'
##                       in the terminal);
## 6. education.english  English translation of education levels.
## 7. books.per.year     Number of books read per year by each respondent;
## 8. typically.reads    Typical genre of books that a respondent reads, 
##                       with three levels a) Fiction; b) Non-fiction;
##                       c) both;
## 9. how.literary       Answer to the question 'How literary a reader do 
##                       you consider yourself to be?', where respondents
##                       could fill in a number from 1 - 7, with 1 meaning
##                       'not literary at all' and 7 meaning 'very literary';
## 10. s.4a1              Answer to the question: 'I like reading novels that 
##                       I can relate to my own life'. Scale from 1 - 5, with 
##                       1 meaning 'completely disagree', and 5 meaning 
##                       'completely agree';
## 11. s.4a2             Answer to the question: 'The story of a novel is what 
##                       matters most to me'. Scale from 1 - 5; 
## 12. s.4a3             Answer to the question: 'The writing style in a book 
##                       is important to me'. Scale from 1 - 5;
## 13. s.4a4             Answer to the question: 'I like searching for deeper 
##                       layers in a novel'. Scale from 1 - 5;
## 14. s.4a5             Answer to the question: 'I like reading literature'. 
##                       Scale from 1 - 5;
## 15. s.4a6             Answer to the question: 'I read novels to discover new 
##                       worlds and unknown time periods'. Scale from 1 - 5;
## 16. s.4a7             Answer to the question: 'I mostly read novels during my 
##                       vacation'. Scale from 1 - 5;
## 17. s.4a8             Answer to the question: 'I usually read several novels at 
##                       the same time'. Scale from 1 - 5;
## 18. s.12b1            Answer to the question: 'I like novels based on real 
##                       events'. Scale from 1 - 5;
## 19. s.12b2            Answer to the question: 'I like thinking about a novel's 
##                       structure'. Scale from 1 - 5;
## 20. s.12b3            Answer to the question: 'The writing style in a novel 
##                       is of more importance to me than its story'. 
##                       Scale from 1 - 5;  
## 21. s.12b4            Answer to the question: 'I like to get carried away by 
##                       a novel'. Scale from 1 - 5;
## 22. s.12b5            Answer to the question: 'I like to pick my books from 
##                       the top 10 list of best sold books'. Scale from 1 - 5;
## 23. s.12b6            Answer to the question: 'I read novels to be challenged 
##                       intellectually'. Scale from 1 - 5;
## 24. s.12b7            Answer to the question: 'I love novels that are easy 
##                       to read'. Scale from 1 - 5;
## 25. s.12b8            Answer to the question: 'In the evening, I prefer 
##                       to read books over watching TV'. Scale from 1 - 5;
## 26. remarks.survey    Any additional remarks that respondents filled in
##                       at the end of the survey;
## 27. date.time         Date and time of the moment a respondent filled in
##                       the survey, format in YYYY-MM-DD HH:MM:SS;
## 28. week.nr           Number of week in which the respondent filled in 
##                       the survey;
## 29. day               Day of the week in which the respondent filled in
##                       the survey.

explain("motivations")

## The 'motivations' dataset contains all motivations that people provided 
## about why they gave a certain book a specific rating. The motivations have been
## parsed to provide POS tag information
##         
## Here follows a list with the different column names and an explanation of
## what information they contain:
## 
## 1. motivation.id      Unique number for each motivation given;
## 2. respondent.id      Unique number for each respondent;
## 3. book.id            Unique number of the book the motivation pertains to;
## 4. paragraph.id       Number of paragraph within the motivation;
## 5. sentence.id        Number of sentence within the paragraph;
## 6. token              Token (in sentence order);
## 7. lemma              Lemma of token;
## 8. upos               POS tag of token;

explain("frequencies")

## This is a dataframe containing numerical values for word frequencies
## of the 5000 most frequent words (in a descending order of frequency)
## of 401 literary novels in Dutch. The table contains relative frequencies,
## meaning that the original word occurrences from a book were divided 
## by the total number of words of the book in question.
## 
## The row names coincide with the column short.title from the data frame books.
## The column names list the 5000 most frequent words in the corpus.

Combine data from books, survey, reviews

The package provides a function to combine all information of the survey, reviews, and books into one big dataframe. The user can specify whether or not s/he wants to also load the freqTable with the frequency counts of the word n-grams of the books.

Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format)

dat = combine.all(load.freq.table = FALSE)

## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`

Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format), and additionally also load the frequency table of all word 1grams of the corpus used.

dat = combine.all(load.freq.table = TRUE)

## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`

Find dataset

Return the name of the dataset where a column can be found.

find.dataset("book.id")

## [1] "books"   "reviews"

find.dataset("age.resp")

## [1] "respondents"

It’s useful to combine it with the already-discussed function get.columns().

Make table (and plot it!)

Make a table of frequency counts for one variable, and plot a histogram of the results. Not sure which variable you want to plot? Invoke the above-discussed function get.columns() once more, to see which variables you can choose from:

get.columns()

Now the fun stuff:

make.table(table.of = 'age.resp')

## 
##  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##  63  56  67  66  83 104 126 150 160 156 152 153 142 128 145 143 141 128 126 139 
##  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55 
## 123 139 135 124 147 148 181 178 209 196 208 231 229 258 283 312 331 343 372 384 
##  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75 
## 389 409 419 394 389 389 407 362 382 445 459 309 312 272 222 159 143 130  96 107 
##  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  93  94  98 
##  70  54  62  42  49  18  25  19   8  10   7   5   8   3   4   1   1   1   1

You can also adjust the x label, y label, title, and colors:

make.table(table.of = 'age.resp', xlab = 'age respondent', 
           ylab = 'number of people', 
           title = 'Distribution of respondent age', 
           barcolor = 'red', barfill = 'white')

## 
##  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##  63  56  67  66  83 104 126 150 160 156 152 153 142 128 145 143 141 128 126 139 
##  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55 
## 123 139 135 124 147 148 181 178 209 196 208 231 229 258 283 312 331 343 372 384 
##  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75 
## 389 409 419 394 389 389 407 362 382 445 459 309 312 272 222 159 143 130  96 107 
##  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  93  94  98 
##  70  54  62  42  49  18  25  19   8  10   7   5   8   3   4   1   1   1   1

Note: please mind that in the above examples we used single quotes to indicate arguments (e.g. xlab = 'age respondent'), whereas at the beginning of the document, we used double quotes (explain("books")). We did it for a reason, namely we wanted to emphasize that the functions provided by the package litRiddle are fully compliant with the generic R syntax, which allows for using either single or double quotes to indicate the strings.

Make table of X split by Y

make.table2(table.of = 'age.resp', split = 'gender.resp')

## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`

##         
##             16    17    18    19    20    21    22    23    24    25    26
##   female   704   748   791   735  1238  1889  2536  2507  2879  3205  2701
##   male      95    59   215   100   437   194   212   227   267   357   535
##   NA        12     0     0     0     2    22    33    10    18     7    14
##         
##             27    28    29    30    31    32    33    34    35    36    37
##   female  3265  2826  2871  3472  2961  3621  3136  3519  3445  2963  3073
##   male     405   429   480   517   487   621   362   401   675   380   909
##   NA         0    19    21    48     0    12     1     0    17    15    10
##         
##             38    39    40    41    42    43    44    45    46    47    48
##   female  3618  2519  4296  4020  5024  5253  5855  5859  5557  6392  6630
##   male     602   606   568   619  1111   852  1103  1036   786   709  1750
##   NA         0     0    42     0   119    44    16    41    75     4     5
##         
##             49    50    51    52    53    54    55    56    57    58    59
##   female  8439  9399  9957 10284 10012 11748 13228 12400 13059 12023 12659
##   male    1055  1455  1669  1920  2074  2066  2014  2522  2045  2459  3206
##   NA       101   148    87    34   194    36    56    39    89     1    34
##         
##             60    61    62    63    64    65    66    67    68    69    70
##   female 11663 12296 11626  9625 11363 12173 10903  8509  8469  6240  5049
##   male    3136  3206  2695  2696  2747  3659  4631  2548  2728  3564  2221
##   NA       100   144    76    56    42     0    51     0     0     6   147
##         
##             71    72    73    74    75    76    77    78    79    80    81
##   female  3530  3495  2991  1944  1905  1863   849   912   758   955   342
##   male    1021  1031  1649   812  1129   471   731   618   237   343   113
##   NA         0     0     0     0     0     0     0     0     0     0     0
##         
##             82    83    84    85    86    87    88    89    90    91    93
##   female   440   190   123   183   294    51   115    53    48    27     0
##   male     216   119    32    88   110    34    23     9    32     0    28
##   NA         0     0     0     0     0     0     0     0     0     0     0
##         
##             94    98
##   female     5    16
##   male       0     0
##   NA         0     0

make.table2(table.of = 'literariness.read', split = 'gender.author')

## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`

## Warning: Removed 309688 rows containing non-finite values (stat_count).

##                   
##                        1     2     3     4     5     6     7
##   female            3565  7145 10667  9259 12221 10630  2530
##   male              1591  4532  7679  8553 16334 26154 13090
##   unknown/multiple   206   491   837   817   977   570   101

Note that you can only provide an argument to the ‘split’ variable that has less than 31 unique values, to avoid uninterpretable outputs. E.g., consider the following code:

make.table2(table.of = 'age.resp', split = 'zipcode')

## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
## The 'split-by' variable has many unique values, which will make the output very
## hard to process. Please providea 'split-by' variable that contains less unique
## values.

You can also adjust the x label, y label, title, and colors:

make.table2(table.of = 'age.resp', split = 'gender.resp', 
            xlab = 'age respondent', ylab = 'number of people', 
            barcolor = 'purple', barfill = 'yellow')

## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`

##         
##             16    17    18    19    20    21    22    23    24    25    26
##   female   704   748   791   735  1238  1889  2536  2507  2879  3205  2701
##   male      95    59   215   100   437   194   212   227   267   357   535
##   NA        12     0     0     0     2    22    33    10    18     7    14
##         
##             27    28    29    30    31    32    33    34    35    36    37
##   female  3265  2826  2871  3472  2961  3621  3136  3519  3445  2963  3073
##   male     405   429   480   517   487   621   362   401   675   380   909
##   NA         0    19    21    48     0    12     1     0    17    15    10
##         
##             38    39    40    41    42    43    44    45    46    47    48
##   female  3618  2519  4296  4020  5024  5253  5855  5859  5557  6392  6630
##   male     602   606   568   619  1111   852  1103  1036   786   709  1750
##   NA         0     0    42     0   119    44    16    41    75     4     5
##         
##             49    50    51    52    53    54    55    56    57    58    59
##   female  8439  9399  9957 10284 10012 11748 13228 12400 13059 12023 12659
##   male    1055  1455  1669  1920  2074  2066  2014  2522  2045  2459  3206
##   NA       101   148    87    34   194    36    56    39    89     1    34
##         
##             60    61    62    63    64    65    66    67    68    69    70
##   female 11663 12296 11626  9625 11363 12173 10903  8509  8469  6240  5049
##   male    3136  3206  2695  2696  2747  3659  4631  2548  2728  3564  2221
##   NA       100   144    76    56    42     0    51     0     0     6   147
##         
##             71    72    73    74    75    76    77    78    79    80    81
##   female  3530  3495  2991  1944  1905  1863   849   912   758   955   342
##   male    1021  1031  1649   812  1129   471   731   618   237   343   113
##   NA         0     0     0     0     0     0     0     0     0     0     0
##         
##             82    83    84    85    86    87    88    89    90    91    93
##   female   440   190   123   183   294    51   115    53    48    27     0
##   male     216   119    32    88   110    34    23     9    32     0    28
##   NA         0     0     0     0     0     0     0     0     0     0     0
##         
##             94    98
##   female     5    16
##   male       0     0
##   NA         0     0

make.table2(table.of = 'literariness.read', split = 'gender.author', 
            xlab = 'Overall literariness scores', 
            ylab = 'number of people', barcolor = 'black', 
            barfill = 'darkred')

## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`

## Warning: Removed 309688 rows containing non-finite values (stat_count).

##                   
##                        1     2     3     4     5     6     7
##   female            3565  7145 10667  9259 12221 10630  2530
##   male              1591  4532  7679  8553 16334 26154 13090
##   unknown/multiple   206   491   837   817   977   570   101

Order responses

The orginal survey about Dutch fiction was designed to rank the responses using descriptive terms, e.g. “very bad”, “neutral”, “a bit good” etc. In order to conduct the analyses, the responses were then converted to numerical scales ranging from 1 to 7 (the questions about literariness and literary quality) or from 1 to 5 (the questions about the reviewer’s reading patterns). However, if you want the responses converted back to their original form, invoke the function order.responses() that transforms the survey responses into ordered factors. Use either “bookratings” or “readingbehavior” to specify which of the survey questions needs to be changed into ordered factors. (We assume here that the user knows what the ordered factors are, because otherwise this function will not seem very useful). Levels of quality.read and quality.notread: “very bad”, “bad”, “a bit bad”, “neutral”, “a bit good”, “good”, “very good”, “NA”. Levels literariness.read and literariness.notread: “absolutely not literary”, “non-literary”, “not very literary”, “between literary and non-literary”,“a bit literary”, “literary”, “very literary”, “NA”. Levels statements 4/12: “completely disagree”, “disagree”, “neutral”, “agree”, “completely agree”, “NA”.

To create a data frame with ordered factor levels of the questions on reading behavior:

dat.reviews = order.responses('readingbehavior')
str(dat.reviews)

## tibble [13,541 × 29] (S3: tbl_df/tbl/data.frame)
##  $ respondent.id    : num [1:13541] 0 1 2 3 4 5 6 7 8 9 ...
##  $ gender.resp      : Factor w/ 3 levels "female","male",..: 1 1 1 1 1 2 1 1 2 1 ...
##  $ age.resp         : num [1:13541] 18 24 78 77 71 58 38 51 66 32 ...
##  $ zipcode          : num [1:13541] 4834 5625 2272 2151 NA ...
##  $ education        : Ord.factor w/ 8 levels "none/primary school"<..: 5 7 7 5 5 7 6 6 7 7 ...
##  $ education.english: Ord.factor w/ 8 levels "No education / primary school"<..: 5 7 7 5 5 7 6 6 7 7 ...
##  $ books.per.year   : num [1:13541] 20 30 30 12 15 60 25 30 50 2 ...
##  $ typically.reads  : Ord.factor w/ 5 levels "completely disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ how.literary     : Ord.factor w/ 5 levels "completely disagree"<..: 3 3 3 3 4 2 3 3 1 3 ...
##  $ s.4a1            : Ord.factor w/ 5 levels "completely disagree"<..: 4 4 4 3 4 2 3 2 3 2 ...
##  $ s.4a2            : Ord.factor w/ 5 levels "completely disagree"<..: 4 4 5 4 3 4 5 3 4 5 ...
##  $ s.4a3            : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 4 4 4 5 4 5 4 4 ...
##  $ s.4a4            : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 4 3 4 3 1 4 4 4 ...
##  $ s.4a5            : Ord.factor w/ 5 levels "completely disagree"<..: 5 5 4 3 4 4 3 5 5 4 ...
##  $ s.4a6            : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 4 4 4 4 4 4 3 4 ...
##  $ s.4a7            : Ord.factor w/ 5 levels "completely disagree"<..: 4 3 3 2 2 1 3 2 2 5 ...
##  $ s.4a8            : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 3 4 2 3 1 5 4 1 ...
##  $ s.12b1           : Ord.factor w/ 5 levels "completely disagree"<..: 2 4 4 3 4 2 3 2 3 3 ...
##  $ s.12b2           : Ord.factor w/ 5 levels "completely disagree"<..: 4 1 4 4 3 4 2 3 5 3 ...
##  $ s.12b3           : Ord.factor w/ 5 levels "completely disagree"<..: 3 3 3 3 3 3 2 3 3 3 ...
##  $ s.12b4           : Ord.factor w/ 5 levels "completely disagree"<..: 4 3 4 4 4 4 5 4 4 4 ...
##  $ s.12b5           : Ord.factor w/ 5 levels "completely disagree"<..: 1 2 3 2 3 1 2 2 4 2 ...
##  $ s.12b6           : Ord.factor w/ 5 levels "completely disagree"<..: 4 4 4 3 3 4 2 4 3 2 ...
##  $ s.12b7           : Ord.factor w/ 5 levels "completely disagree"<..: 2 3 4 4 4 2 5 3 2 2 ...
##  $ s.12b8           : num [1:13541] 3 4 3 4 3 3 2 4 3 3 ...
##  $ remarks.survey   : chr [1:13541] "" "" "" "" ...
##  $ date.time        : POSIXct[1:13541], format: "2013-06-04 11:12:00" "2013-04-10 15:33:00" ...
##  $ week.nr          : num [1:13541] 23 15 15 27 15 29 15 15 15 15 ...
##  $ day              : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 4 5 6 4 2 4 4 4 5 ...

To create a data frame with ordered factor levels of the book ratings:

dat.ratings = order.responses('bookratings')
str(dat.ratings)

## tibble [448,055 × 7] (S3: tbl_df/tbl/data.frame)
##  $ respondent.id       : num [1:448055] 0 0 0 0 0 0 0 0 0 0 ...
##  $ book.id             : num [1:448055] 1 9 11 19 30 34 82 116 300 372 ...
##  $ quality.read        : Ord.factor w/ 7 levels "very bad"<"bad"<..: 6 5 7 5 5 7 5 5 6 6 ...
##  $ literariness.read   : Ord.factor w/ 7 levels "absolutely not literary"<..: 5 6 6 6 4 6 3 5 6 6 ...
##  $ quality.notread     : Ord.factor w/ 7 levels "very bad"<"bad"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ literariness.notread: Ord.factor w/ 7 levels "absolutely not literary"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ book.read           : num [1:448055] 1 1 1 1 1 1 1 1 1 1 ...

Frequencies

The data frame frequencies contains numerical values for word frequencies of the 5000 most frequent words (in a descending order of frequency) of 401 literary novels in Dutch. The table contains relative frequencies, meaning that the original word occurrences from a book were divided by the total number of words of the book in question. The measurments were obtained using the R package stylo, and were later rounded to the 5th digit. To learn more about the novels themselves, type help(books).

The row names of the frequencies data frame contain the titles of the novels corresponding to the title.short column in the data frame books.

rownames(frequencies)[10:20]

##  [1] "Allende_NegendeSchriftVan"     "Amirrezvani_DochterVanIsfahan"
##  [3] "Ammaniti_JijEnIk"              "Ammaniti_LaatFeestBeginnen"   
##  [5] "Ammaniti_LaatsteOudejaarVan"   "Ammaniti_ZoGodWil"            
##  [7] "Appel_VanTweeKanten"           "Appel_Weerzin"                
##  [9] "Auel_LiedVanGrotten"           "Austin_EindelijkThuis"        
## [11] "Avallone_Staal"

Listing the relative frequency values for the novel Weerzin by Rene Appel:

frequencies["Appel_Weerzin",][1:10]

##      de     het      en     een      ik      ze     dat     hij     van      in 
## 2.91937 2.89593 1.80979 2.64645 1.10820 2.11716 1.75603 2.50586 1.40455 1.20331

And getting the book information:

books[books["short.title"]=="Appel_Weerzin",]

##       short.title      author   title title.english    genre book.id
## 391 Appel_Weerzin Appel, René Weerzin     *Aversion Suspense     391
##                riddle.code   riddle.code.english translated gender.author
## 391 305 LITERAIRE THRILLER 305 Literary thriller         no          male
##     origin.author original.language inclusion.criterion publication.date
## 391            NL                NL          bestseller       2011-10-31
##     first.print        publisher word.count type.count sentence.length.mean
## 391        2008 Ambo/Anthos B.V.      72134       5710               9.9646
##     sentence.length.variance paragraph.count sentence.count
## 391                   7.2009            2168           7239
##     paragraph.length.mean raw.TTR sampled.TTR
## 391               33.2721  0.0792      0.2466

Motivations

Version 1.0 of the package introduces a table motivations, containing the 200k+ lemmatized and POS tagged tokens making up the text of all motivations. The Dutch Language Institute (INT, Leiden) took care of POS-tagging the data. The tagging was manually corrected by Karina van Dalen-Oskam. We tried to guarantee the highest possible quality, but mistakes may still occur.

The solution to add a token based table was chosen to not burden the table reviews with lots of text, XML, or JSON in additional columns, leading to potential problems with default memory constraints in R.

To retrieve all tokens:

data(motivations)
head(motivations, 15)

##    motivation.id respondent.id book.id paragraph.id sentence.id       token
## 1              1             0      82            1           1         Het
## 2              1             0      82            1           1          is
## 3              1             0      82            1           1         een
## 4              1             0      82            1           1        snel
## 5              1             0      82            1           1          te
## 6              1             0      82            1           1       lezen
## 7              1             0      82            1           1          en
## 8              1             0      82            1           1        snel
## 9              1             0      82            1           1          te
## 10             1             0      82            1           1 doorgronden
## 11             1             0      82            1           1        boek
## 12             1             0      82            1           1         met
## 13             1             0      82            1           1      vlakke
## 14             1             0      82            1           1  personages
## 15             1             0      82            1           1          en
##          lemma     upos
## 1          het PRON_DET
## 2           is     VERB
## 3          een PRON_DET
## 4         snel      ADJ
## 5           te      ADP
## 6        lezen     VERB
## 7           en     CONJ
## 8         snel      ADJ
## 9           te      ADP
## 10 doorgronden     VERB
## 11        boek     NOUN
## 12         met      ADP
## 13        vlak      ADJ
## 14   personage     NOUN
## 15          en     CONJ

From tokens to text

Usually one will probably want to work with the full text of motivations. A convenience function motivations.text() is provided to create a view that has one motivation per row:

# We're importing `dplyr` to use `tibble` so we can 
# show very large tables somewhat nicer.
suppressMessages(library(dplyr))  

mots = motivations.text()
tibble(mots)

## # A tibble: 11,950 × 4
##    motivation.id book.id respondent.id text                                     
##            <dbl>   <dbl>         <dbl> <chr>                                    
##  1             1      82             0 Het is een snel te lezen en snel te door…
##  2             2      46             1 Ik vond HhhH eerder een verslag van een …
##  3             3     153             2 het is goed verteld , heeft ook een hist…
##  4             4     239             3 Een prachtig verhaal en uitstekend gesch…
##  5             5     248             5 Het is een goed en snel verteld verhaal …
##  6             6     382             6 -                                        
##  7             7      91             7 Heel de opbouw van het verhaal , de pers…
##  8             8     311             8 Het is de combinatie van vorm en inhoud …
##  9             9     128            10 Wanneer ik denk dat ik een boek nogmaals…
## 10            10     392            11 Beeldend en goed geschreven .            
## # ℹ 11,940 more rows

NOTE: The dplyr package hides the explain function from the package litRiddle because it has its own explain function. To use litRiddle’s explain function after dplyr has been loaded, call it explicitly, like this: litRiddle::explain(“books”).

Gathering all motivations for, for instance, one book, requires some trivial merging. Let’s see what people said about Binet’s HhhH. For this we select the book information of the book with ID 46 and we left join (merge) that (book.id by book.id) with the table mots having all the motivations:

mots_hhhh <- merge(x = books[books["book.id"]==46,], y = mots, by = "book.id", all.x = TRUE)
tibble(mots_hhhh)

## # A tibble: 64 × 28
##    book.id short.title author         title title.english genre   riddle.code   
##      <int> <fct>       <fct>          <fct> <fct>         <fct>   <fct>         
##  1      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  2      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  3      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  4      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  5      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  6      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  7      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  8      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
##  9      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
## 10      46 Binet_Hhhh  Binet, Laurent HhhH  HhhH          Fiction 301-302 (VERT…
## # ℹ 54 more rows
## # ℹ 21 more variables: riddle.code.english <fct>, translated <fct>,
## #   gender.author <fct>, origin.author <fct>, original.language <fct>,
## #   inclusion.criterion <fct>, publication.date <date>, first.print <int>,
## #   publisher <fct>, word.count <int>, type.count <int>,
## #   sentence.length.mean <dbl>, sentence.length.variance <dbl>,
## #   paragraph.count <int>, sentence.count <int>, paragraph.length.mean <dbl>, …

Hmm… pretty wide table, select the text column to get an idea of what is being said, and print with the n parameter to see more rows:

print(tibble(mots_hhhh[,"text"]), n = 40)

## # A tibble: 64 × 1
##    `mots_hhhh[, "text"]`                                                        
##    <chr>                                                                        
##  1 vanwege de verpakking                                                        
##  2 Het vertelperspectief , het feit dat de schrijver en onderzoeker ook op zijn…
##  3 Fictie en non-fictie ( geschiedenis ) gemengd , met gebruikmaking van litera…
##  4 Ja , dat kan ik . Hoewel het hier om non-fictie gaat , zet de auteur alle mo…
##  5 Omdat het zo is .                                                            
##  6 Op grond van het spel dat Binet speelt met de verwachtingen van de lezer , d…
##  7 Tussen historische roman en geromantiseerde historie .                       
##  8 Door de vorm waarin het geschreven is . Vooral de beschouwingen over bronnen…
##  9 ' Vanwege het complexe gedachtengoed dat deze roman rijk is taalgebruik is v…
## 10 de verteller neemt mij mee in de beschreven wereld. ik voel me aangesproken …
## 11 Het is fascinerend . Mooi spel met wat werkelijk gebeurd is en wat de schrij…
## 12 Plot en spanning in combinatie met historische context                       
## 13 Op een bijzondere wijze schakelt Binet regelmatig de lezer zelf in om zijn v…
## 14 Per ongeluk aangevinkt , niet gelezen , sorry                                
## 15 Het is een geweldig onderzoek en goed geschreven bovendien . Toch is het gee…
## 16 ' Het ging ' ergens over '.'                                                 
## 17 -                                                                            
## 18 Die kwaliteit ligt in de structuur en beeldende taal van het boek . Eeen goe…
## 19 Je moet in dit boek groeien . Dan laat het je niet meer los . Het blijft lan…
## 20 Het boek heeft meerdere lagen , en benadert vanuit een originele invalshoek …
## 21 Als of zelfs H nog een ziel kan hebben .                                     
## 22 mixture fictie en non-fictie                                                 
## 23 Het boek zet mij aan tot nadenken en is ' vloeiend ' geschreven .            
## 24 Het schrijfproces van de schrijver en zijn verhaal lopen in verschillende ho…
## 25 veel interieure monologen , bizarre gedachtenspinsels , geheimzinnige zinswe…
## 26 het heeft op mij een goede indruk gemaakt over het verleden                  
## 27 ' Ik baseer me slechts op een enkel hoofdstuk maar voor wat dat aangaat , is…
## 28 Zeer indringend geschreven , het is een buitengewoon gevoelig onderwerp , ma…
## 29 In het boek wordt op verfijnde wijze gespeeld met de grens tussen fictie en …
## 30 Er wordt niet alleen een verhaal verteld , de verteller is ook duidelijk vra…
## 31 schrijftsijl , opbouw                                                        
## 32 Ik vind het bijzonder hoe Binet non-fictie en fictie met elkaar weet te comb…
## 33 het is een nieuw genre : de persoonlijke biografie ( combinatie van biografi…
## 34 Echt andere vorm van verhalen , ernstig geschreven . In het Frans gelezen en…
## 35 Het is een verhalend soort non-fictie , gelezen in vertaling , dat wel , waa…
## 36 Interessant spel met het werk van de historicus : bevragen van de eigen meth…
## 37 Nee , dat kan ik moeilijk toelichten . De rol van de auteur zelf binnen het …
## 38 omdat het mij niet boeide                                                    
## 39 Ik vond het een vreselijk boek misschien dat ik daarom denk dat het in hoge …
## 40 Binet behandelt niet alleen de geschiedenis van Heydrich , maar ook de manie…
## # ℹ 24 more rows

Gathering review information and motivations together

If we also want to include review information, this requires another merge. Rather than trying to combine all data in one huge statement, it is usually easier to follow a step by step methog. First let’s collect the motivations for HhhH. We will be more selective of columns. If you compare the following query with the merge statement above, you will find that we use only author and title from books and only repsondent ID and the motivational text from mots, while we use book.id from both to match for merging.

mots_hhhh = merge(x = books[books["book.id"] == 46, c("book.id", "author", "title")], y = mots[, c("book.id", "respondent.id", "text")], by = "book.id", all.x = TRUE)
tibble(mots_hhhh)

## # A tibble: 64 × 5
##    book.id author         title respondent.id text                              
##      <int> <fct>          <fct>         <dbl> <chr>                             
##  1      46 Binet, Laurent HhhH            361 vanwege de verpakking             
##  2      46 Binet, Laurent HhhH           4503 Het vertelperspectief , het feit …
##  3      46 Binet, Laurent HhhH           9910 Fictie en non-fictie ( geschieden…
##  4      46 Binet, Laurent HhhH           1923 Ja , dat kan ik . Hoewel het hier…
##  5      46 Binet, Laurent HhhH            505 Omdat het zo is .                 
##  6      46 Binet, Laurent HhhH           1242 Op grond van het spel dat Binet s…
##  7      46 Binet, Laurent HhhH           4963 Tussen historische roman en gerom…
##  8      46 Binet, Laurent HhhH           1425 Door de vorm waarin het geschreve…
##  9      46 Binet, Laurent HhhH            968 ' Vanwege het complexe gedachteng…
## 10      46 Binet, Laurent HhhH           8986 de verteller neemt mij mee in de …
## # ℹ 54 more rows

We now have a new view that we can again merge with the information in the reviews data:

tibble(merge(x = mots_hhhh, y = reviews, by = c("book.id", "respondent.id"), all.x = TRUE))

## # A tibble: 64 × 10
##    book.id respondent.id author       title text  quality.read literariness.read
##      <int>         <dbl> <fct>        <fct> <chr>        <dbl>             <dbl>
##  1      46             1 Binet, Laur… HhhH  Ik v…            7                 2
##  2      46          1022 Binet, Laur… HhhH  Een …            7                 6
##  3      46         10278 Binet, Laur… HhhH  veel…            7                 7
##  4      46         10350 Binet, Laur… HhhH  Het …            7                 5
##  5      46         10546 Binet, Laur… HhhH  Fant…            6                 7
##  6      46         10735 Binet, Laur… HhhH  Het …            7                 7
##  7      46         10980 Binet, Laur… HhhH  Stij…            6                 7
##  8      46          1121 Binet, Laur… HhhH  ' He…            6                 6
##  9      46         11443 Binet, Laur… HhhH  omda…            4                NA
## 10      46         11586 Binet, Laur… HhhH  Op e…            7                 7
## # ℹ 54 more rows
## # ℹ 3 more variables: quality.notread <dbl>, literariness.notread <dbl>,
## #   book.read <dbl>

Note how we use a vector for by to ensure we match on book ID and respondent ID. If we would use only book.id we would get all score for that book by all respondents, but we want the score by these particular respondents that motivated their rating.

And – being sceptical as we always should be about our strategies – let us just check that we didn’t miss anything, and sample if indeed repsondent 1022 had only one rating for HhhH:

reviews[ reviews["respondent.id"] == 1022 & reviews["book.id"] == 46, ]

##       respondent.id book.id quality.read literariness.read quality.notread
## 33356          1022      46            7                 6              NA
##       literariness.notread book.read
## 33356                   NA         1

Working with lemma and POS tag information

Suppose we want to look into word frequencies of motivations. We can use base R table to get an idea of how often what combination of lemma and POS tag appears in the motivations:

toks = motivations  # Remmber: that is a *token* table, one token + lemma + POS tag per row.
head(table(toks$lemma, toks$upos), n = 30)

##            
##              ADJ  ADP  ADV CONJ INTJ NOUN  NUM PRON_DET PROPN PUNCT VERB    X
##   -            0    0    0    0    0    0    0        0     0   358    0    0
##   --           0    0    0    0    0    0    0        0     0     5    0    0
##   ---          0    0    0    0    0    0    0        0     0     3    0    0
##   ----         0    0    0    0    0    0    0        0     0     2    0    0
##   -)'          0    0    0    0    0    0    0        0     0     1    0    0
##   –            0    0    0    0    0    0    0        0     0     4    0    0
##   ,            0    0    0    0    0    0    0        0     0  9558    0    0
##   ,,           0    0    0    0    0    0    0        0     0     4    0    0
##   ,.           0    0    0    0    0    0    0        0     0     4    0    0
##   ,...         0    0    0    0    0    0    0        0     0     3    0    0
##   ,...,        0    0    0    0    0    0    0        0     0     1    0    0
##   ,....        0    0    0    0    0    0    0        0     0     1    0    0
##   ,........    0    0    0    0    0    0    0        0     0     1    0    0
##   ,'           0    0    0    0    0    0    0        0     0     1    0    0
##   ,(           0    0    0    0    0    0    0        0     0     1    0    0
##   :            0    0    0    0    0    0    0        0     0   312    0    0
##   :-)          0    0    0    0    0    0    0        0     0     3    0    0
##   :-)'         0    0    0    0    0    0    0        0     0     1    0    0
##   :)           0    0    0    0    0    0    0        0     0     4    0    0
##   !            0    0    0    0    0    0    0        0     0   175    0    0
##   !,           0    0    0    0    0    0    0        0     0     1    0    0
##   !!           0    0    0    0    0    0    0        0     0    12    0    0
##   !!!          0    0    0    0    0    0    0        0     0     6    0    0
##   !!!!         0    0    0    0    0    0    0        0     0     1    0    0
##   !!'          0    0    0    0    0    0    0        0     0     4    0    0
##   !'           0    0    0    0    0    0    0        0     0     8    0    0
##   !',          0    0    0    0    0    0    0        0     0     1    0    0
##   !)           0    0    0    0    0    0    0        0     0     5    0    0
##   !)'          0    0    0    0    0    0    0        0     0     1    0    0
##   ¡            0    0    0    0    0    0    0        0     0     1    0    0

Wow, respondents are creative about using punctuation! In the interest of completeness we chose not to clean out all those emoticons from the data set. However, here we don’t need those. So we filter, and sort. The code in the next cell is not trivial if you are new to R, or regular expressions. Hopefully the inserted comments will clarify a bit. Note, just in case you run into puzzling errors, this uses the dplyr.filter as we imported dplyr above. Base R filter requires a different approach.

# filter out tokens that do not start with at least one word character
# we use regular expression "\w+" which means "more than one word character", 
# the added backslash prevents R from interpreting the backslash as an
# escape character. 
mots = filter(motivations, grepl('\\w+', lemma))

# create a data frame out of a table of raw frequencies.
# Look up 'table function' in R documentation. 
mots = data.frame(table(mots$lemma, mots$upos))

# use interpretable column names
colnames(mots) = c("lemma", "upos", "freq")

# select only useful information, i.e. those lemma+pos combinations 
# that appear more than 0 times
mots = mots[mots['freq'] > 0, ]

# sort from most used to least used
mots = mots[order(mots$freq, decreasing = TRUE), ]

# finally show as a nicer looking table
tibble(mots)

## # A tibble: 10,749 × 3
##    lemma upos      freq
##    <fct> <fct>    <int>
##  1 het   PRON_DET  9889
##  2 de    PRON_DET  6983
##  3 een   PRON_DET  5586
##  4 ik    PRON_DET  5210
##  5 en    CONJ      5017
##  6 is    VERB      4756
##  7 van   ADP       3766
##  8 niet  ADV       3730
##  9 boek  NOUN      3573
## 10 in    ADP       2917
## # ℹ 10,739 more rows

And rather unsurprisingly it is the pronouns and other functors that lead the pack.

For another exercise, let’s look up something about the lemma “boek” (en. “book”):

mots[mots["lemma"] == "boek", ]

##        lemma upos freq
## 50328   boek NOUN 3573
## 109362  boek    X    3

Linguistic parsers are not infallible. Apparently in three cases the parser did not know how to classify the word “boek”, in which case the POS tag handed out is “X”. Can we find the contexts where those linguistic unknowns were used? For this, first we find the book IDs from the books where this happened:

# First we find the motivation IDs from the books where this happens.
boekx = motivations[motivations["lemma"] == "boek" & motivations["upos"] == "X", ]
boekx

##        motivation.id respondent.id book.id paragraph.id sentence.id token lemma
## 317               12            14     159            1           1  boek  boek
## 144901          8132          9288     107            1           1  boek  boek
## 147507          8258          9431      62            1           2  boek  boek
##        upos
## 317       X
## 144901    X
## 147507    X

Now we need the full texts of all motivations, so we can find those three motivations we are looking for.

mots_text = motivations.text()

To find the three motivations we merge the boekx table and the table with all the motivations, and we keep only those rows that pertain to the three motivation IDs. I.e. we merge onby="motvation.id" with all.x=TRUE, implying that we will keep all rows from x (i.e. the three motivations with the “boek” POS tagged as “X”) and that we will drop all non-related y (i.e. all those motivations that do not have those linguistically unknown “boek” mentions).

boekx_mots_text =  merge( x = boekx, y = mots_text, by = "motivation.id", all.x = TRUE)

And finally we show what those contexts are:

tibble(boekx_mots_text[ , c( "book.id.x", "respondent.id.x", "text")])

## # A tibble: 3 × 3
##   book.id.x respondent.id.x text                                                
##       <dbl>           <dbl> <chr>                                               
## 1       159              14 maakt me niet uit of het literair is als ik het eee…
## 2       107            9288 Het is een feel good boek , er is geen literaire wa…
## 3        62            9431 Is net een graadje beter dan 3 stuivers romans , ma…

And just for good measure the full text of the third mention:

boekx_mots_text[3, "text"]

## [1] "Is net een graadje beter dan 3 stuivers romans , maar het blijft lectuur , verstrooiing , boeken die je zeer waarschijnlijk geen tweede keer leest want dat weet je al wie elkaar krijgen en hoe . Feel good boek ."

Likert plots

Next versions of the litRiddle package will support likert plots. Visit https://github.com/jbryer/likert to learn more about the general idea and the implementation in R.

Topic modeling

Next versions of the litRiddle package will support topic modeling of the motivations indicated by the reviewers.

Documentation

Each function provided by the package has its own help page; the same applies to the datasets:

help(books)
help(respondents)
help(reviews)
help(frequencies)
help(combine.all)
help(explain)
help(find.dataset)
help(get.columns)
help(make.table)
help(make.table2)
help(order.responses)
help(litRiddle) # for the general description of the package

Possible issues

All the datasets use the UTF-8 encoding (also known as the Unicode). This should normally not cause any problems on MacOS and Linux machines, but Windows might be more tricky in this respect. We haven’t experienced any inconveniences in our testing environment, but we cannot say the same about all the other machines.

References

Karina van Dalen-Oskam (2023). The Riddle of Literary Quality: A Computational Approach. Amsterdam University Press.

Karina van Dalen-Oskam (2021). Het raadsel literatuur. Is literaire kwaliteit meetbaar? Amsterdam University Press.

Maciej Eder, Saskia Lensink, Joris van Zundert, Karina van Dalen-Oskam (2022). Replicating The Riddle of Literary Quality: The litRiddle package for R, in: Digital Humanities 2022 Conference Abstracts. The University of Tokyo, Japan, 25–29 July 2022, p. 636–637 https://dh2022.dhii.asia/dh2022bookofabsts.pdf

Corina Koolen, Karina van Dalen-Oskam, Andreas van Cranenburgh, Erica Nagelhout (2020). Literary quality in the eye of the Dutch reader: The National Reader Survey. Poetics 79: 101439, https://doi.org/10.1016/j.poetic.2020.101439.

More publications from the project: see https://literaryquality.huygens.knaw.nl/?page_id=588.

Literary Quality of Dutch Novels with litRiddle

Maciej Eder

Joris van Zundert

Karina van Dalen-Oskam

Saskia Lensink

Introduction

Installation

Usage

The dataset

Print column names

Explain variables

Combine data from books, survey, reviews

Find dataset

Make table (and plot it!)

Make table of X split by Y

Order responses

Frequencies

Motivations

From tokens to text

Gathering review information and motivations together

Working with lemma and POS tag information

Likert plots

Topic modeling

Documentation

Possible issues

References