Leverages dplyr
to process the calculations of a plot
inside a database. This package provides helper functions that abstract
the work at three levels:
ggplot2
objectdata.frame
object with the
calculationsYou can install the released version from CRAN:
# install.packages("dbplot")
Or the the development version from GitHub, using the
remotes
package:
# install.packages("remotes")
# remotes::install_github("edgararuiz/dbplot")
For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com
To use Spark, please visit the sparklyr
official
website: http://spark.rstudio.com
In addition to database connections, the functions work with
sparklyr
. A local RSQLite
database will be
used for the examples in this README.
library(DBI)
library(odbc)
library(dplyr)
<- dbConnect(RSQLite::SQLite(), ":memory:")
con <- copy_to(con, nycflights13::flights, "flights") db_flights
ggplot
By default dbplot_histogram()
creates a 30 bin
histogram
library(ggplot2)
%>%
db_flights dbplot_histogram(distance)
Use binwidth
to fix the bin size
%>%
db_flights dbplot_histogram(distance, binwidth = 400)
Because it outputs a ggplot2
object, more customization
can be done
%>%
db_flights dbplot_histogram(distance, binwidth = 400) +
labs(title = "Flights - Distance traveled") +
theme_bw()
To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.
A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.
fill
argument is passed, the default calculation
will be count, n()
%>%
db_flights dbplot_raster(sched_dep_time, sched_arr_time)
%>%
db_flights dbplot_raster(
sched_dep_time,
sched_arr_time, mean(distance, na.rm = TRUE)
)
resolution
argument controls that, it defaults to 100%>%
db_flights dbplot_raster(
sched_dep_time,
sched_arr_time, mean(distance, na.rm = TRUE),
resolution = 20
)
dbplot_bar()
defaults to a tally() of each value in a
discrete variable%>%
db_flights dbplot_bar(origin)
%>%
db_flights dbplot_bar(origin, avg_delay = mean(dep_delay, na.rm = TRUE))
dbplot_line()
defaults to a tally() of each value in a
discrete variable%>%
db_flights dbplot_line(month)
%>%
db_flights dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))
It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn’t calculate outliers. It has been tested with the following connections:
sparklyr
Here is an example using dbplot_boxplot()
with a local
data frame:
::flights %>%
nycflights13dbplot_boxplot(origin, distance)
If a more customized plot is needed, the data the underpins the plots can also be accessed:
db_compute_bins()
- Returns a data frame with the bins
and count per bindb_compute_count()
- Returns a data frame with the
count per discrete valuedb_compute_raster()
- Returns a data frame with the
results per x/y intersectiondb_compute_raster2()
- Returns same as
db_compute_raster()
function plus the coordinates of the
x/y boxesdb_compute_boxplot()
- Returns a data frame with
boxplot calculations%>%
db_flights db_compute_bins(arr_delay)
#> # A tibble: 28 x 2
#> arr_delay count
#> <dbl> <int>
#> 1 NA 9430
#> 2 -86 5325
#> 3 -40.7 207999
#> 4 4.53 79784
#> 5 49.8 19063
#> 6 95.1 7890
#> 7 140. 3746
#> 8 186. 1742
#> 9 231. 921
#> 10 276. 425
#> # … with 18 more rows
The data can be piped to a plot
%>%
db_flights filter(arr_delay < 100 , arr_delay > -50) %>%
db_compute_bins(arr_delay) %>%
ggplot() +
geom_col(aes(arr_delay, count, fill = count))
db_bin()
Uses ‘rlang’ to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.
db_bin(var)
#> (((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30) * ifelse(as.integer(floor((var -
#> min(var, na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var,
#> na.rm = TRUE))/30))) == 30, as.integer(floor((var - min(var,
#> na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30))) -
#> 1, as.integer(floor((var - min(var, na.rm = TRUE))/((max(var,
#> na.rm = TRUE) - min(var, na.rm = TRUE))/30))))) + min(var,
#> na.rm = TRUE)
%>%
db_flights group_by(x = !! db_bin(arr_delay)) %>%
tally()
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.29.0 [:memory:]
#> x n
#> <dbl> <int>
#> 1 NA 9430
#> 2 -86 5325
#> 3 -40.7 207999
#> 4 4.53 79784
#> 5 49.8 19063
#> 6 95.1 7890
#> 7 140. 3746
#> 8 186. 1742
#> 9 231. 921
#> 10 276. 425
#> # … with more rows
%>%
db_flights filter(!is.na(arr_delay)) %>%
group_by(x = !! db_bin(arr_delay)) %>%
tally()%>%
%>%
collect ggplot() +
geom_col(aes(x, n))
dbDisconnect(con)