CRAN Task View: Natural Language Processing

Fridolin Wild

CRAN Task View: Natural Language Processing

Maintainer:	Fridolin Wild
Contact:	wild at open.ac.uk
Version:	2023-09-12
URL:	https://CRAN.R-project.org/view=NaturalLanguageProcessing
Source:	https://github.com/cran-task-views/NaturalLanguageProcessing/
Contributions:	Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide.
Citation:	Fridolin Wild (2023). CRAN Task View: Natural Language Processing. Version 2023-09-12. URL https://CRAN.R-project.org/view=NaturalLanguageProcessing.
Installation:	The packages from this task view can be installed automatically using the ctv package. For example, `ctv::install.views("NaturalLanguageProcessing", coreOnly = TRUE)` installs all the core packages or `ctv::update.views("NaturalLanguageProcessing")` installs all packages that are not yet installed and up-to-date. See the CRAN Task View Initiative for more details.

Natural language processing has come a long way since its foundations were laid in the 1940s and 50s (for an introduction see, e.g., Jurafsky and Martin (2008, 2009, 2022 draft third edition): Speech and Language Processing, Pearson Prentice Hall). This CRAN task view collects relevant R packages that support computational linguists in conducting analysis of speech and language on a variety of levels - setting focus on words, syntax, semantics, and pragmatics.

In recent years, we have elaborated a framework to be used in packages dealing with the processing of written material: the package tm. Extension packages in this area are highly recommended to interface with tm’s basic routines and useRs are cordially invited to join in the discussion on further developments of this framework package.

A basic introduction with comprehensive examples is provided in the book by Fridolin Wild (2016): Learning Analytics in R, Springer.

Frameworks:

tm provides a comprehensive text mining framework for R. The Journal of Statistical Software article Text Mining Infrastructure in R gives a detailed overview and presents techniques for count-based analysis methods, text clustering, text classification and string kernels.
tm.plugin.dc allows for distributing corpora across storage devices (local files or Hadoop Distributed File System).
tm.plugin.mail helps with importing mail messages from archive files such as used in Thunderbird (mbox, eml).
tm.plugin.alceste allows importing text corpora written in a file in the Alceste format.
RcmdrPlugin.temis is an Rcommander plug-in providing an integrated solution to perform a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering.
openNLP provides an R interface to OpenNLP, a collection of natural language processing tools including a sentence detector, tokenizer, pos-tagger, shallow and full syntactic parser, and named-entity detector, using the Maxent Java package for training and using maximum entropy models.
Trained models for English and Spanish to be used with openNLP are available from http://datacube.wu.ac.at/ as packages openNLPmodels.en and openNLPmodels.es, respectively.
RWeka is a interface to Weka which is a collection of machine learning algorithms for data mining tasks written in Java. Especially useful in the context of natural language processing is its functionality for tokenization and stemming.
tidytext provides means for text mining for word processing and sentiment analysis using dplyr, ggplot2, and other tidy tools.
udpipe provides language-independant tokenization, part of speech tagging, lemmatization, dependency parsing, and training of treebank-based annotation models.

Words (lexical DBs, keyword extraction, string manipulation, stemming)

R’s base package already provides a rich set of character manipulation routines. See help.search(keyword = "character", package = "base") for more information on these capabilities.
wordnet provides an R interface to WordNet, a large lexical database of English.
RKEA provides an R interface to KEA (Version 5.0). KEA (for Keyphrase Extraction Algorithm) allows for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
gsubfn can be used for certain parsing tasks such as extracting words from strings by content rather than by delimiters. demo("gsubfn-gries") shows an example of this in a natural language processing context.
textreuse provides a set of tools for measuring similarity among documents and helps with detecting passages which have been reused. The package implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.
boilerpipeR helps with the extraction and sanitizing of text content from HTML files: removal of ads, sidebars, and headers using the boilerpipe Java library.
tau contains basic string manipulation and analysis routines needed in text processing such as dealing with character encoding, language, pattern counting, and tokenization.
SnowballC provides exactly the same API as Rstem, but uses a slightly different design of the C libstemmer library from the Snowball project. It also supports two more languages.
stringi provides R language wrappers to the International Components for Unicode (ICU) library and allows for: conversion of text encodings, string searching and collation in any locale, Unicode normalization of text, handling texts with mixed reading direction (e.g., left to right and right to left), and text boundary analysis (for tokenizing on different aggregation levels or to identify suitable line wrapping locations).
stringdist implements an approximate string matching version of R’s native ‘match’ function. It can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal string alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.
Rstem (available from Omegahat) is an alternative interface to a C version of Porter’s word stemming algorithm.
koRpus is a diverse collection of functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall). See the web page for more information.
ore provides an alternative to R’s built-in functionality for handling regular expressions, based on the Onigmo Regular Expression Library. Offers first-class compiled regex objects, partial matching and function-based substitutions, amongst other features. A benchmark comparing results for ore functions with stringi and the R base implementation is available regex-performance.
languageR provides data sets and functions exemplifying statistical methods, and some facilitatory utility functions used in the book by R. H. Baayen: “Analyzing Linguistic Data: a Practical Introduction to Statistics Using R”, Cambridge University Press, 2008.
zipfR offers some statistical models for word frequency distributions. The utilities include functions for loading, manipulating and visualizing word frequency data and vocabulary growth curves. The package also implements several statistical models for the distribution of word frequencies in a population. (The name of this library derives from the most famous word frequency distribution, Zipf’s law.)
wordcloud provides a visualisation similar to the famous wordle ones: it horizontally and vertically distributes features in a pleasing visualisation with the font size scaled by frequency.
hunspell is a stemmer and spell-checker library designed for languages with rich morphology and complex word compounding or character encoding. The package can check and analyze individual words as well as search for incorrect words within a text, latex or (R package) manual document.
phonics provides a collection of phonetic algorithms including Soundex, Metaphone, NYSIIS, Caverphone, and others.
tesseract is an OCR engine with unicode (UTF-8) support that can recognize over 100 languages out of the box.
mscsweblm4r provides an interface to the Microsoft Cognitive Services Web Language Model API and can be used to calculate the probability for a sequence of words to appear together, the conditional probability that a specific word will follow an existing sequence of words, get the list of words (completions) most likely to follow a given sequence of words, and insert spaces into a string of words adjoined together without any spaces (hashtags, URLs, etc.).
mscstexta4r provides an interface to the Microsoft Cognitive Services Text Analytics API and can be used to perform sentiment analysis, topic detection, language detection, and key phrase extraction.
sentencepiece is an unsupervised tokeniser producing Byte Pair Encoding (BPE), Unigram, Char, or Word models.
tokenizers helps split text into tokens, supporting shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions.
tokenizers.bpe helps split text into syllable tokens, implemented using Byte Pair Encoding and the YouTokenToMe library.
crfsuite uses Conditional Random Fields for labelling sequential data.
jiebaR Chinese text segmentation, keyword extraction and speech tagging For R.
keyperm implements a novel approach to keyword analysis based on permutation tests, i.e. identifcation of words that are significantly more frequent in one corpus compared to another.

Semantics:

lsa provides routines for performing a latent semantic analysis with R. The basic idea of latent semantic analysis (LSA) is, that text do have a higher order (=latent semantic) structure which, however, is obscured by word usage (e.g. through the use of synonyms or polysemy). By using conceptual indices that are derived statistically via a truncated singular value decomposition (a two-mode factor analysis) over a given document-term matrix, this variability problem can be overcome. The article Representing and Analysing Meaning with LSA by Wild (2016) gives a detailed overview and comprehensive examples.
topicmodels provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors.
BTM helps identify topics in texts from term-term cooccurrences (hence ‘biterm’ topic model, BTM).
topicdoc provides topic-specific diagnostics for LDA and CTM topic models to assist in evaluating topic quality.
lda implements Latent Dirichlet Allocation and related models similar to LSA and topicmodels.
stm (Structural Topic Model) implements a topic model derivate that can include document-level meta-data. The package also includes tools for model selection, visualization, and estimation of topic-covariate regressions.
kernlab allows to create and compute with string kernels, like full string, spectrum, or bounded range string kernels. It can directly use the document format used by tm as input.
golgotha (not yet on CARN) provides a wrapper to Bidirectional Encoder Representations from Transformers (BERT) for language modelling and textual entailment in particular.
ruimtehol provides a neural network machine learning approach to vector space semantics, implementing an interface to StarSpace, providing means for classification, proximity measurement, and model management (training, predicting, several interfaces for textual entailment of varying granularity).
skmeans helps with clustering providing several algorithms for spherical k-means partitioning.
movMF provides another clustering alternative (approximations are fitted with von Mises-Fisher distributions of the unit length vectors).
sentometrics Optimized prediction based on textual sentiment, accounting for the intrinsic challenge that sentiment can be computed and pooled across texts and time in various ways. See Ardia et al. (2021)
svs offers simple implementations of various techniques for semantic vector spaces (viz. latent semantic analysis, probabilistic latent semantic analysis, non-negative matrix factorization, EM clustering, latent class analysis, correspondence analysis, …).
textir is a suite of tools for text and sentiment mining.
textcat provides support for n-gram based text categorization.
textrank is an extension of the PageRank and allows to summarize text by calculating how sentences are related to one another.
corpora offers utility functions for the statistical analysis of corpus frequency data.
text2vec provides tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities.
word2vec allows to learn vector representations of words by continuous bag of words and skip-gram implementations of the “word2vec” algorithm. The techniques are detailed in the paper Distributed Representations of Words and Phrases and their Compositionality by Mikolov et al. (2013).

Pragmatics:

qdap helps with quantitative discourse analysis of transcripts.
quanteda supports quantitative analysis of textual data.
sentiment.ai performs sentence-level text embedding and context-aware sentiment analysis with pre-trained sentiment models (you can also use your own models). This approach outperforms bag-of-words-based sentiment analysis and allows the user to perform clustering and other supervised learning on embedded text in addition to the built-in sentiment scoring functions. sentiment.ai also provides functionality to use cosine similarity to perform matching between matrices of embedded text. Currently supports Google’s Universal Sentence Encoder models (English or Multi-Lingual) though can support other Tensorflow hub embedding models. Can use GPU acceleration via NVIDIA Cuda for scalability.

Corpora:

corporaexplorer facilitates visual information retrieval over document collections, supporting filtering and corpus-level as well as document-level visualisation using an interactive web apps built using Shiny.
textplot provides various methods for corpus-, document-, and sentence-level visualisation.
tm.plugin.factiva, tm.plugin.lexisnexis, tm.plugin.europresse allow importing press and Web corpora from (respectively) Dow Jones Factiva, LexisNexis, and Europresse.

CRAN packages

Core:

tm.

Regular:

boilerpipeR, BTM, corpora, corporaexplorer, crfsuite, gsubfn, hunspell, jiebaR, kernlab, keyperm, koRpus, languageR, lda, lsa, movMF, mscstexta4r, mscsweblm4r, openNLP, ore, phonics, qdap, quanteda, RcmdrPlugin.temis, RKEA, ruimtehol, RWeka, sentencepiece, sentiment.ai, sentometrics, skmeans, SnowballC, stm, stringdist, stringi, svs, tau, tesseract, text2vec, textcat, textir, textplot, textrank, textreuse, tidytext, tm.plugin.alceste, tm.plugin.dc, tm.plugin.europresse, tm.plugin.factiva, tm.plugin.lexisnexis, tm.plugin.mail, tokenizers, tokenizers.bpe, topicdoc, topicmodels, udpipe, word2vec, wordcloud, wordnet, zipfR.

Other resources

GitHub Project: golgotha
GitHub Project: regex-performance
Omegahat Package: Rstem