skimr
is designed to provide summary statistics about
variables in data frames, tibbles, data tables and vectors. It is
opinionated in its defaults, but easy to modify.
In base R, the most similar functions are summary()
for
vectors and data frames and fivenum()
for numeric
vectors:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
summary(iris$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
fivenum(iris$Sepal.Length)
## [1] 4.3 5.1 5.8 6.4 7.9
summary(iris$Species)
## setosa versicolor virginica
## 50 50 50
skim()
functionThe core function of skimr
is skim()
, which
is designed to work with (grouped) data frames, and will try coerce
other objects to data frames if possible. Like summary()
,
skim()
’s method for data frames presents results for every
column; the statistics it provides depend on the class of the
variable.
By design, the main focus of skimr
is on data frames; it
is intended to fit well within a data pipeline and relies
extensively on tidyverse
vocabulary, which focuses on data frames.
Results of skim()
are printed horizontally,
with one section per variable type and one row per variable.
library(skimr)
skim(iris)
## ── Data Summary ────────────────────────
## Values
## Name iris
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## factor 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique
## 1 Species 0 1 FALSE 3
## top_counts
## 1 set: 50, ver: 50, vir: 50
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
## 2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
## 3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
## 4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
The format of the results are a single wide data frame combining the results, with some additional attributes and two metadata columns:
skim_variable
: name of the original variableskim_type
: class of the variableUnlike many other objects within R
, these columns are
intrinsic to the skim_df
class. Dropping these variables
will result in a coercion to a tibble
. The
is_skim_df()
function is used to assert that an object is a
skim_df.
skim(iris) %>% is_skim_df()
## [1] TRUE
## attr(,"message")
## character(0)
skim(iris) %>%
::select(-skim_type, -skim_variable) %>% is_skim_df() dplyr
## [1] FALSE
## attr(,"message")
## [1] "Object is not a `skim_df`: missing column `skim_type`; missing column `skim_variable`"
skim(iris) %>%
::select(-n_missing) %>% is_skim_df() dplyr
## [1] TRUE
## attr(,"message")
## character(0)
In order to avoid type coercion, columns for summary statistics for
different types are prefixed with the corresponding
skim_type
. This means that the columns of the
skim_df
are somewhat sparse, with quite a few missing
values. This is because for some statistics the representations for
different types of variables is different. For example, the mean of a
Date variable and of a numeric variable are represented differently when
printing, but this cannot be supported in a single vector. The exception
to this are n_missing
and complete_rate
(missing/number of observations) which are the same for all types of
variables.
skim(iris) %>%
::as_tibble() tibble
## # A tibble: 5 × 15
## skim_type skim_variable n_missing complete_rate factor.ordered factor.n_unique
## <chr> <chr> <int> <dbl> <lgl> <int>
## 1 factor Species 0 1 FALSE 3
## 2 numeric Sepal.Length 0 1 NA NA
## 3 numeric Sepal.Width 0 1 NA NA
## 4 numeric Petal.Length 0 1 NA NA
## 5 numeric Petal.Width 0 1 NA NA
## # … with 9 more variables: factor.top_counts <chr>, numeric.mean <dbl>,
## # numeric.sd <dbl>, numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>,
## # numeric.p75 <dbl>, numeric.p100 <dbl>, numeric.hist <chr>
This is in contrast to summary.data.frame()
, which
stores statistics in a table
. The distinction is important,
because the skim_df
object is pipeable and easy to use for
additional manipulation: for example, the user could select all of the
variable means, or all summary statistics for a specific variable.
skim(iris) %>%
::filter(skim_variable == "Petal.Length") dplyr
## ── Data Summary ────────────────────────
## Values
## Name iris
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## numeric 1
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Most dplyr
verbs should work as expected.
skim(iris) %>%
::select(skim_type, skim_variable, n_missing) dplyr
## ── Data Summary ────────────────────────
## Values
## Name iris
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## factor 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────
## skim_variable n_missing
## 1 Species 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing
## 1 Sepal.Length 0
## 2 Sepal.Width 0
## 3 Petal.Length 0
## 4 Petal.Width 0
The base skimmers n_missing
and
complete_rate
are computed for all of the columns in the
data. But all other type-based skimmers have a namespace. You need to
use a skim_type
prefix to refer to correct column.
skim(iris) %>%
::select(skim_type, skim_variable, numeric.mean) dplyr
## ── Data Summary ────────────────────────
## Values
## Name iris
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## factor 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable mean
## 1 Sepal.Length 5.84
## 2 Sepal.Width 3.06
## 3 Petal.Length 3.76
## 4 Petal.Width 1.20
skim()
also supports grouped data created by
dplyr::group_by()
. In this case, one additional column for
each grouping variable is added to the skim_df
object.
%>%
iris ::group_by(Species) %>%
dplyrskim()
## ── Data Summary ────────────────────────
## Values
## Name Piped data
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## numeric 4
## ________________________
## Group variables Species
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable Species n_missing complete_rate mean sd p0 p25 p50
## 1 Sepal.Length setosa 0 1 5.01 0.352 4.3 4.8 5
## 2 Sepal.Length versicolor 0 1 5.94 0.516 4.9 5.6 5.9
## 3 Sepal.Length virginica 0 1 6.59 0.636 4.9 6.22 6.5
## 4 Sepal.Width setosa 0 1 3.43 0.379 2.3 3.2 3.4
## 5 Sepal.Width versicolor 0 1 2.77 0.314 2 2.52 2.8
## 6 Sepal.Width virginica 0 1 2.97 0.322 2.2 2.8 3
## 7 Petal.Length setosa 0 1 1.46 0.174 1 1.4 1.5
## 8 Petal.Length versicolor 0 1 4.26 0.470 3 4 4.35
## 9 Petal.Length virginica 0 1 5.55 0.552 4.5 5.1 5.55
## 10 Petal.Width setosa 0 1 0.246 0.105 0.1 0.2 0.2
## 11 Petal.Width versicolor 0 1 1.33 0.198 1 1.2 1.3
## 12 Petal.Width virginica 0 1 2.03 0.275 1.4 1.8 2
## p75 p100 hist
## 1 5.2 5.8 ▃▃▇▅▁
## 2 6.3 7 ▂▇▆▃▃
## 3 6.9 7.9 ▁▃▇▃▂
## 4 3.68 4.4 ▁▃▇▅▂
## 5 3 3.4 ▁▅▆▇▂
## 6 3.18 3.8 ▂▆▇▅▁
## 7 1.58 1.9 ▁▃▇▃▁
## 8 4.6 5.1 ▂▂▇▇▆
## 9 5.88 6.9 ▃▇▇▃▂
## 10 0.3 0.6 ▇▂▂▁▁
## 11 1.5 1.8 ▅▇▃▆▁
## 12 2.3 2.5 ▂▇▆▅▇
Individual columns from a data frame may be selected using tidyverse-style selectors.
skim(iris, Sepal.Length, Species)
## ── Data Summary ────────────────────────
## Values
## Name iris
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## factor 1
## numeric 1
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique
## 1 Species 0 1 FALSE 3
## top_counts
## 1 set: 50, ver: 50, vir: 50
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
Or with common select
helpers.
skim(iris, starts_with("Sepal"))
## ── Data Summary ────────────────────────
## Values
## Name iris
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## numeric 2
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
## 2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
If an individual column is of an unsupported class, it is treated as a character variable with a warning.
In skimr
v2, skim()
will attempt to coerce
non-data frames (such as vectors and matrices) to data frames. In most
cases with vectors, the object being evaluated should be equivalent to
wrapping the object in as.data.frame()
.
For example, the lynx
data set is class
ts
.
skim(lynx)
## ── Data Summary ────────────────────────
## Values
## Name lynx
## Number of rows 114
## Number of columns 1
## _______________________
## Column type frequency:
## ts 1
## ________________________
## Group variables None
##
## ── Variable type: ts ───────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate start end frequency deltat mean sd
## 1 x 0 1 1821 1934 1 1 1538. 1586.
## min max median line_graph
## 1 39 6991 771 ⡈⢄⡠⢁⣀⠒⣀⠔
Which is the same as coercing to a data frame.
all.equal(skim(lynx), skim(as.data.frame(lynx)))
## [1] "Attributes: < Component \"df_name\": 1 string mismatch >"
skimr
does not support skimming matrices directly but
coerces them to data frames. Columns in the matrix become variables.
This behavior is similar to summary.matrix()
). Three
possible ways to handle matrices with skim()
parallel the
three variations of the mean function for matrices.
<- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3)
m m
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
Skimming the matrix produces similar results to
colMeans()
.
colMeans(m)
## [1] 2.5 6.5 10.5
skim(m) # Similar to summary.matrix and colMeans()
## ── Data Summary ────────────────────────
## Values
## Name m
## Number of rows 4
## Number of columns 3
## _______________________
## Column type frequency:
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 V1 0 1 2.5 1.29 1 1.75 2.5 3.25 4 ▇▇▁▇▇
## 2 V2 0 1 6.5 1.29 5 5.75 6.5 7.25 8 ▇▇▁▇▇
## 3 V3 0 1 10.5 1.29 9 9.75 10.5 11.2 12 ▇▇▁▇▇
Skimming the transpose of the matrix will give row-wise results.
rowMeans(m)
## [1] 5 6 7 8
skim(t(m))
## ── Data Summary ────────────────────────
## Values
## Name t(m)
## Number of rows 3
## Number of columns 4
## _______________________
## Column type frequency:
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 V1 0 1 5 4 1 3 5 7 9 ▇▁▇▁▇
## 2 V2 0 1 6 4 2 4 6 8 10 ▇▁▇▁▇
## 3 V3 0 1 7 4 3 5 7 9 11 ▇▁▇▁▇
## 4 V4 0 1 8 4 4 6 8 10 12 ▇▁▇▁▇
And call c()
on the matrix to get results across all
columns.
skim(c(m))
## ── Data Summary ────────────────────────
## Values
## Name c(m)
## Number of rows 12
## Number of columns 1
## _______________________
## Column type frequency:
## numeric 1
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 data 0 1 6.5 3.61 1 3.75 6.5 9.25 12 ▇▅▅▅▇
mean(m)
## [1] 6.5
skim_tee()
produces the same printed version as
skim()
but returns the original, unmodified data frame.
This allows for continued piping of the original data.
<- iris %>%
iris_setosa skim_tee() %>%
::filter(Species == "setosa") dplyr
## ── Data Summary ────────────────────────
## Values
## Name data
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## factor 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique
## 1 Species 0 1 FALSE 3
## top_counts
## 1 set: 50, ver: 50, vir: 50
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
## 2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
## 3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
## 4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
head(iris_setosa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Note, that skim_tee()
is customized differently than
skim
itself. See below for more details.
skim()
As noted above, skim()
returns a wide data frame. This
is usually the most sensible format for the majority of operations when
investigating data, but the package has some other functions to help
with edge cases.
First, partition()
returns a named list of the wide data
frames for each data type. Unlike the original data the partitioned data
only has columns corresponding to the skimming functions used for this
data type. These data frames are, therefore, not skim_df
objects.
%>%
iris skim() %>%
partition()
## $factor
##
## ── Variable type: factor ───────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 Species 0 1 FALSE 3 set: 50, ver: 50, vir:…
##
## $numeric
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
## 2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
## 3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
## 4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Alternatively, yank()
selects only the subtable for a
specific type. Think of it like dplyr::select
on column
types in the original data. Again, unsuitable columns are dropped.
%>%
iris skim() %>%
yank("numeric")
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
## 2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
## 3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
## 4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
to_long()
returns a single long data frame with columns
variable
, type
, statistic
and
formatted
. This is similar but not identical to the
skim_df
object in skimr
v1.
%>%
iris skim() %>%
to_long() %>%
head()
## # A tibble: 6 × 4
## skim_type skim_variable stat formatted
## <chr> <chr> <chr> <chr>
## 1 factor Species n_missing 0
## 2 numeric Sepal.Length n_missing 0
## 3 numeric Sepal.Width n_missing 0
## 4 numeric Petal.Length n_missing 0
## 5 numeric Petal.Width n_missing 0
## 6 factor Species complete_rate 1
Since the skim_variable
and skim_type
columns are a core component of the skim_df
class, it’s
possible to get unwanted side effects when using
dplyr::select()
. Instead, use focus()
to
select columns of the skimmed results and keep them as a
skim_df
; it always keeps the metadata column.
%>%
iris skim() %>%
focus(n_missing, numeric.mean)
## ── Data Summary ────────────────────────
## Values
## Name Piped data
## Number of rows 150
## Number of columns 5
## _______________________
## Column type frequency:
## factor 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: factor ───────────────────────────────────────────────────────
## skim_variable n_missing
## 1 Species 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing mean
## 1 Sepal.Length 0 5.84
## 2 Sepal.Width 0 3.06
## 3 Petal.Length 0 3.76
## 4 Petal.Width 0 1.20
skim()
The skim_df
object is a wide data frame. The display is
created by default using print.skim_df()
; users can specify
additional options by explicitly calling
print([skim_df object], ...)
.
For documents rendered by knitr
, the package provides a
custom knit_print
method. To use it, the final line of your
code chunk should have a skim_df
object.
skim(Orange)
Name | Orange |
Number of rows | 35 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Tree | 0 | 1 | TRUE | 5 | 3: 7, 1: 7, 5: 7, 2: 7 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 922.14 | 491.86 | 118 | 484.0 | 1004 | 1372.0 | 1582 | ▃▇▁▇▇ |
circumference | 0 | 1 | 115.86 | 57.49 | 30 | 65.5 | 115 | 161.5 | 214 | ▇▃▇▇▅ |
The same type of rendering is available from reshaped
skim_df
objects, those generated by
partition()
and yank()
in particular.
skim(Orange) %>%
yank("numeric")
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 922.14 | 491.86 | 118 | 484.0 | 1004 | 1372.0 | 1582 | ▃▇▁▇▇ |
circumference | 0 | 1 | 115.86 | 57.49 | 30 | 65.5 | 115 | 161.5 | 214 | ▇▃▇▇▅ |
Although its not a common use case outside of writing vignettes about
skimr
, you can fall back to default printing methods by
adding the chunk option render = knitr::normal_print
.
You can also disable the skimr
summary by setting the
chunk option skimr_include_summary = FALSE
.
You can change the number of digits shown in the columns of generated
statistics by changing the skimr_digits
chunk option.
skim()
skimr
is opinionated in its choice of defaults, but
users can easily add, replace, or remove the statistics for a class. For
interactive use, you can create your own skimming function with the
skim_with()
factory. skimr
also has an API for
extensions in other packages. Working with that is covered later.
To add a statistic for a data type, create an sfl()
(a
skimr
function list) for each class that you want to
change:
<- skim_with(numeric = sfl(new_mad = mad))
my_skim my_skim(faithful)
Name | faithful |
Number of rows | 272 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | new_mad |
---|---|---|---|---|---|---|---|---|---|---|---|
eruptions | 0 | 1 | 3.49 | 1.14 | 1.6 | 2.16 | 4 | 4.45 | 5.1 | ▇▂▂▇▇ | 0.95 |
waiting | 0 | 1 | 70.90 | 13.59 | 43.0 | 58.00 | 76 | 82.00 | 96.0 | ▃▃▂▇▂ | 11.86 |
As the previous example suggests, the default is to append
new summary statistics to the preexisting set. This behavior
isn’t always desirable, especially when you want lots of changes. To
stop appending, set append = FALSE
.
<- skim_with(numeric = sfl(new_mad = mad), append = FALSE)
my_skim my_skim(faithful)
Name | faithful |
Number of rows | 272 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | new_mad |
---|---|---|---|
eruptions | 0 | 1 | 0.95 |
waiting | 0 | 1 | 11.86 |
You can also use skim_with()
to remove specific
statistics by setting them to NULL
. This is commonly used
to disable the inline histograms and spark graphs.
<- skim_with(ts = sfl(line_graph = NULL))
no_hist no_hist(Nile)
Name | Nile |
Number of rows | 100 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
ts | 1 |
________________________ | |
Group variables | None |
Variable type: ts
skim_variable | n_missing | complete_rate | start | end | frequency | deltat | mean | sd | min | max | median |
---|---|---|---|---|---|---|---|---|---|---|---|
x | 0 | 1 | 1871 | 1970 | 1 | 1 | 919.35 | 169.23 | 456 | 1370 | 893.5 |
The same pattern applies to changing skimmers for multiple classes simultaneously. If you want to partially-apply function arguments, use the Tidyverse lambda syntax.
<- skim_with(
my_skim numeric = sfl(total = ~ sum(., na.rm = TRUE)),
factor = sfl(missing = ~ sum(is.na(.))),
append = FALSE
)
my_skim(iris)
Name | iris |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | missing |
---|---|---|---|
Species | 0 | 1 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | total |
---|---|---|---|
Sepal.Length | 0 | 1 | 876.5 |
Sepal.Width | 0 | 1 | 458.6 |
Petal.Length | 0 | 1 | 563.7 |
Petal.Width | 0 | 1 | 179.9 |
To modify the “base” skimmers, refer to them in a similar manner.
Since base skimmers are usually a small group, they must return the same
type for all data types in R, append
doesn’t apply
here.
<- skim_with(base = sfl(length = length))
my_skim my_skim(faithful)
Name | faithful |
Number of rows | 272 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | length | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|
eruptions | 272 | 3.49 | 1.14 | 1.6 | 2.16 | 4 | 4.45 | 5.1 | ▇▂▂▇▇ |
waiting | 272 | 70.90 | 13.59 | 43.0 | 58.00 | 76 | 82.00 | 96.0 | ▃▃▂▇▂ |
skimr
Packages may wish to export their own skim()
functions.
Use skim_with()
for this. In fact, this is how
skimr
generates its version of skim()
.
#' @export
<- skim_with() my_package_skim
Alternatively, defaults for another data types can be added to
skimr
with the get_skimmers
generic. The
method for your data type should return an sfl()
. Unlike
the sfl()
used interactively, you also need to set the
skim_type
argument. It should match the method type in the
function signature.
<- function(column) {
get_skimmers.my_type sfl(
skim_type = "my_type",
total = sum
)
}
<- data.frame(
my_data my_type = structure(1:3, class = c("my_type", "integer"))
)skim(my_data)
Name | my_data |
Number of rows | 3 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
my_type | 1 |
________________________ | |
Group variables | None |
Variable type: my_type
skim_variable | n_missing | complete_rate | total |
---|---|---|---|
my_type | 0 | 1 | 6 |
An extended example is available in the vignette Supporting additional objects.
The details of rendering are dependent on the operating system R is running on, the locale of the installation, and the fonts installed. Rendering may also differ based on whether it occurs in the console or when knitting to specific types of documents such as HTML and PDF.
The most commonly reported problems involve rendering the spark
graphs (inline histogram and line chart) on Windows. One common fix is
to switch your locale. The function
fix_windows_histograms()
does this for you.
In order to render the sparkgraphs in html or PDF histogram you may need to change fonts to one that supports blocks or Braille (depending on which you need). Please review the separate vignette and associated template for details.