The keyToEnglish
package provides a function to create
an easy-to-remember hash of a field by using a traditional hashing
function, and then mapping substrings of the hash to a list of
words.
Normally, you should install it from CRAN with
install.packages('keyToEnglish')
The latest version on GitHub can be installed via
install.packages('devtools')
devtools::install_github("mcandocia/keyToEnglish")
Note: this should ideally be done from a new R session.
The simplest way to use the package is by using the default word
list, wl_common
, and having fields as follows:
keyToEnglish(1:5)
# [1] "ProcedureCombAdmitCountryVoice" "ParentPericarpNotionPompousTreat"
# [3] "BuckSlackenReflectPublicationDeaden" "AssociatedAldehydePastDisgraceOppressive"
# [5] "CrimsonHeelParasiteWritBenefit"
You can also provide your own word list to the word_list
parameter of keyToEnglish()
.
If you provide a list of lists, you can create a phrase or sentence
with a specific structure. wml_long_sentence
is provided,
which can create sentences such as
Why would anyone need this? The main reason I could think of is being able to remember a string or other value that has been anonymized while reducing the chance of a collision.
There are also additional functions for building word lists in the
package, but I haven’t documented/exported most of them, apart from
corpora_to_word_list()
, which takes a list of files and
builds a word list by reading them. It can also parse JSON, too, which
can be useful for API-retrieved data.
# example loading files from a directory
my_word_list = corpora_to_word_list(
dir('/some/directory/with/text/files'),
max_size=16^4
)
# example loading JSON files downloaded from Wikipedia
my_word_list = corpora_to_word_list(
dir('/some/directory/with/json/files'),
max_size=16^4,
json_path=c('query','export','*')
)
By default, there are 5 phrases with 16^3 combinations, which is
about 1.15 * 10^18
. This should be somewhat safe up until
100 million values, where it becomes more likely you will see
collisions. If you increase it to 8, then it is safe to about 13
trillion values.
keyToEnglish()
Max Candocia
GPL (>= 2)
Word lists from Michael Wehar’s repository are under public domain.