{hash}
This vignette provides a comparison of {r2r}
with the same-purpose CRAN package {hash}
, which also offers an implementation of hash tables based on R environments. We first describe the features offered by both packages, and then perform some benchmark timing comparisons. The package versions referred to in this vignette are:
library(hash)
library(r2r)
packageVersion("hash")
#> [1] '2.2.6.1'
packageVersion("r2r")
#> [1] '0.1.1'
Both {r2r}
and {hash}
hash tables are built on top of the R built-in environment
data structure, and have thus a similar API. In particular, hash table objects have reference semantics for both packages. {r2r}
hashtable
s are S3 class objects, whereas in {hash}
the data structure is implemented as an S4 class.
Hash tables provided by r2r
support arbitrary type keys and values, arbitrary key comparison and hash functions, and have customizable behaviour (either throw an exception or return a default value) upon query of a missing key.
In contrast, hash tables in hash
currently support only string keys, with basic identity comparison (the hashing is performed automatically by the underlying environment
objects); values can be arbitrary R objects. Querying missing keys through non-vectorized [[
-subsetting returns the default value NULL
, whereas queries through vectorized [
-subsetting result in an error. On the other hand, hash
also offers support for inverting hash tables (an experimental feature at the time of writing).
The table below summarizes the features of the two packages
Feature | r2r | hash |
---|---|---|
Basic data structure | R environment | R environment |
Arbitrary type keys | X | |
Arbitrary type values | X | X |
Arbitrary hash function | X | |
Arbitrary key comparison function | X | |
Throw or return default on missing keys | X | |
Hash table inversion | X |
We will perform our benchmark tests using the CRAN package microbenchmark
.
library(microbenchmark)
We start by timing the insertion of:
<- 1e4 N
random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (i.e. length one character) keys. We can generate random keys as follows:
<- c(letters, LETTERS, 0:9)
chars <- function(n) paste0(
random_keys sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE)
)
set.seed(840)
<- random_keys(N)
keys <- rnorm(N) values
We test both the non-vectorized ([[<-
) and vectorized ([<-
) operators:
microbenchmark(
`r2r_[[<-` = {
for (i in seq_along(keys))
<- values[[i]]
m_r2r[[ keys[[i]] ]]
},`r2r_[<-` = { m_r2r[keys] <- values },
`hash_[[<-` = {
for (i in seq_along(keys))
<- values[[i]]
m_hash[[ keys[[i]] ]]
},`hash_[<-` = m_hash[keys] <- values,
times = 30,
setup = { m_r2r <- hashmap(); m_hash <- hash() }
)#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[<- 113.56767 175.58814 214.9829 211.1825 258.1184 313.1902 30
#> r2r_[<- 114.14440 159.72019 199.9455 189.6857 233.6445 374.7078 30
#> hash_[[<- 88.96304 134.33801 169.9768 171.0875 196.2304 287.2871 30
#> hash_[<- 47.83579 69.46998 111.3905 109.9926 137.5920 283.5104 30
As it is seen, r2r
and hash
have comparable performances at the insertion of key-value pairs, with both vectorized and non-vectorized insertions, hash
being somewhat more efficient in both cases.
We now test key query, again both in non-vectorized and vectorized form:
microbenchmark(
`r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
`r2r_[` = { m_r2r[ keys ] },
`hash_[[` = { for (key in keys) m_hash[[ key ]] },
`hash_[` = { m_hash[ keys ] },
times = 30,
setup = {
<- hashmap(); m_r2r[keys] <- values
m_r2r <- hash(); m_hash[keys] <- values
m_hash
}
)#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[ 123.62104 167.21320 227.27907 229.56724 288.04601 355.13941 30
#> r2r_[ 108.73472 135.10280 194.41796 182.84417 235.26994 377.83851 30
#> hash_[[ 14.48737 17.93966 26.40463 25.36491 29.91747 58.90447 30
#> hash_[ 79.87292 113.61985 131.27771 125.32752 145.64064 233.47484 30
For non-vectorized queries, hash
is significantly faster (by one order of magnitude) than r2r
. This is likely due to the fact that the [[
method dispatch is handled natively by R in hash
(i.e. the default [[
method for environment
s is used ), whereas r2r
suffers the overhead of S3 method dispatch. This is confirmed by the result for vectorized queries, which is comparable for the two packages; notice that here a single (rather than N
) S3 method dispatch occurs in the r2r
timed expression.
As an additional test, we perform the benchmarks for non-vectorized expressions with a new set of keys:
set.seed(841)
<- random_keys(N)
new_keys microbenchmark(
`r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
`hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
times = 30,
setup = {
<- hashmap(); m_r2r[keys] <- values
m_r2r <- hash(); m_hash[keys] <- values
m_hash
}
)#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[_bis 88.03148 117.47402 163.23105 152.44477 194.02813 381.82736 30
#> hash_[[_bis 13.88594 14.80847 21.30825 20.73479 26.69868 33.17278 30
The results are similar to the ones already commented. Finally, we test the performances of the two packages in checking the existence of keys (notice that here has_key
refers to r2r::has_key
, whereas has.key
is hash::has.key
):
set.seed(842)
<- sample(c(keys, new_keys), N)
mixed_keys microbenchmark(
r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
times = 30,
setup = {
<- hashmap(); m_r2r[keys] <- values
m_r2r <- hash(); m_hash[keys] <- values
m_hash
}
)#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_has_key 85.21853 100.4766 123.3598 117.2267 140.8682 205.6607 30
#> hash_has_key 222.51019 277.8647 323.5463 323.3688 371.1581 429.5249 30
The results are comparable for the two packages, r2r
being slightly more performant in this particular case.
Finally, we test key deletion. In order to handle name collisions, we will use delete()
(which refers to r2r::delete()
) and del()
(which refers to hash::del()
).
microbenchmark(
r2r_delete = { for (key in keys) delete(m_r2r, key) },
hash_delete = { for (key in keys) del(key, m_hash) },
hash_vectorized_delete = { del(keys, m_hash) },
times = 30,
setup = {
<- hashmap(); m_r2r[keys] <- values
m_r2r <- hash(); m_hash[keys] <- values
m_hash
}
)#> Unit: milliseconds
#> expr min lq mean median uq
#> r2r_delete 396.948296 482.664244 594.748545 610.982801 675.851870
#> hash_delete 219.420023 258.364174 319.246659 304.905175 363.718673
#> hash_vectorized_delete 3.995802 4.575463 5.519199 5.021491 5.987731
#> max neval
#> 921.373185 30
#> 511.815017 30
#> 8.943831 30
The vectorized version of hash
significantly outperforms the non-vectorized versions (by roughly two orders of magnitude in speed). Currently, r2r
does not support vectorized key deletion 1.
The two R packages r2r
and hash
offer hash table implementations with different advantages and drawbacks. r2r
focuses on flexibility, and has a richer set of features. hash
is more minimal, but offers superior performance in some important tasks. Finally, as a positive note for both parties, the two packages share a similar API, making it relatively easy to switch between the two, according to the particular use case needs.
This is due to complications introduced by the internal hash collision handling system of r2r
.↩︎