utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R’s UTF-8 handling.
utf8 is available on CRAN. To install the latest released version, run the following command in R:
install.packages("utf8")
To install the latest development version, run the following:
devtools::install_github("patperry/r-utf8")
library(utf8)
Use as_utf8()
to validate input text and convert to UTF-8 encoding. The function
alerts you if the input text has the wrong declared encoding:
# second entry is encoded in latin-1, but declared as UTF-8 x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile") Encoding(x) <- c("UTF-8", "UTF-8", "bytes") as_utf8(x) # fails #> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4 # mark the correct encoding Encoding(x[2]) <- "latin1" as_utf8(x) # succeeds #> [1] "façile" "façile" "façile"
Use utf8_normalize()
to convert to Unicode composed normal form (NFC). Optionally apply
compatibility maps for NFKC normal form or case-fold.
# three ways to encode an angstrom character (angstrom <- c("\u00c5", "\u0041\u030a", "\u212b")) #> [1] "Å" "Å" "Å" utf8_normalize(angstrom) == "\u00c5" #> [1] TRUE TRUE TRUE # perform full Unicode case-folding utf8_normalize("Größe", map_case = TRUE) #> [1] "grösse" # apply compatibility maps to NFKC normal form # (example from https://twitter.com/aprilarcus/status/367557195186970624) utf8_normalize("𝖸𝗈 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝗅 𝗁𝖾𝗋𝖽 𝕌 𝗅𝗂𝗄𝖾 𝑡𝑦𝑝𝑒𝑓𝑎𝑐𝑒𝑠 𝗌𝗈 𝗐𝖾 𝗉𝗎𝗍 𝗌𝗈𝗆𝖾 𝚌𝚘𝚍𝚎𝚙𝚘𝚒𝚗𝚝𝚜 𝗂𝗇 𝗒𝗈𝗎𝗋 𝔖𝔲𝔭𝔭𝔩𝔢𝔪𝔢𝔫𝔱𝔞𝔯𝔶 𝔚𝔲𝔩𝔱𝔦𝔩𝔦𝔫𝔤𝔳𝔞𝔩 𝔓𝔩𝔞𝔫𝔢 𝗌𝗈 𝗒𝗈𝗎 𝖼𝖺𝗇 𝓮𝓷𝓬𝓸𝓭𝓮 𝕗𝕠𝕟𝕥𝕤 𝗂𝗇 𝗒𝗈𝗎𝗋 𝒇𝒐𝒏𝒕𝒔.", map_compat = TRUE) #> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."
On some platforms (including MacOS), the R implementation of print()
uses
an outdated version of the Unicode standard to determine which
characters are printable. Use utf8_print()
for an updated print function:
print(intToUtf8(0x1F600 + 0:79)) # with default R print function #> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏" utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line #> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫…" utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit #> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"
Cite utf8 with the following BibTeX entry:
@Manual{,
title = {utf8: Unicode Text Processing},
author = {Patrick O. Perry},
year = {2018},
note = {R package version 1.1.4},
url = {https://github.com/patperry/r-utf8},
}
The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you’d like to contribute, either
fork the repository and submit a pull request
or contact the maintainer via e-mail.
This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.