The sprintr
package contains the implementations of a computationally efficient method, called sprinter, to fit large interaction models based on the reluctant interaction selection principle. The details of the method can be found in Yu, Bien, and Tibshirani (2019) Reluctant interaction modeling. In particular, sprinter
is a multi-stage method that fits the following pairwise interaction model: \[
y = \sum_{j = 1}^p X_j \beta^\ast_j + \sum_{\ell \leq k} X_{\ell} X_k \gamma^\ast_{\ell k} + \varepsilon.
\] This document serves as an introduction of using the package with a simple simulated data example.
We consider the following simple simulation setting, where \(X \sim N(\mathbf{0}, \mathbf{I}_p)\). There are two non-trivial main effects \(\beta_1 = 1\), \(\beta_2 = -2\), and \(\beta_j = 0\) for \(j > 2\). The two important interactions are \(X_1 * X_3\) with \(\gamma_{13} = 3\), and \(X_4 * X_5\) with \(\gamma_{45} = -4\). With \(\varepsilon \sim N(0, 1)\), the following code simulates \(n = 100\) observation from the model above with \(p = 200\).
sprinter
functionThe function sprinter
implements the sprinter method (please note that the function name sprinter
is different from the package name sprintr
), which involves the following three main steps:
square = FALSE
) or with both main effects and squared effects \((X, X^2)\) (if square = TRUE
).num_keep
.lambda
, fit a lasso of the response on main effects, squared effects (if square = TRUE
), and selected interactions from the previous step.There are two tuning parameters: num_keep
(used in Step 2) and lambda
(used in Step 3). If num_keep
is not specified, it will then be set to \(n / \lceil \log n \rceil\) (see, e.g., Fan & Lv (2008)). If lambda
is not specified, then sprinter
would compute its own path of tuning parameter values based on the number of tuning parameters (nlam
) and the range of the path (lam_min_ratio
).
The output of sprinter
is a S3
object including several useful components. In particular, it involves a matrix idx
that represents the index pairs of all variables considered in Step 3:
mod$idx[(p + 1) : nrow(mod$idx), ]
#> index_1 index_2
#> [1,] 96 140
#> [2,] 5 79
#> [3,] 7 113
#> [4,] 113 155
#> [5,] 94 148
#> [6,] 5 173
#> [7,] 108 144
#> [8,] 7 165
#> [9,] 17 77
#> [10,] 158 175
#> [11,] 4 102
#> [12,] 58 108
#> [13,] 5 97
#> [14,] 30 168
#> [15,] 5 182
#> [16,] 5 25
#> [17,] 135 168
#> [18,] 115 175
#> [19,] 3 177
#> [20,] 4 96
#> [21,] 1 3
#> [22,] 4 5
Since Step 3 of sprinter
always includes the main effects, mod$idx[(p + 1): nrow(mod$idx), ]
contains the indices of all the selected interactions from Step 2. The two columns of this output represents the index pair \((\ell, k)\) of a selected interaction \(X_\ell * X_k\), where \(\ell \leq k\). Note that here the last two rows are the true interactions \(X_1 * X_3\) and \(X_4 * X_5\). If the first entry of an index pair is zero, i.e., \((\ell = 0, k)\), then it represents a main effect \(X_k\).
The output mod$coef
is a nrow(mod$idx)
-by-length(mod$lambda)
matrix. Each column of mod$coef
is a vector of estimate of all variable coefficients considered in Step 3 corresponding to one value of the lasso tuning parameter lambda
. For example, for the 30-th tuning parameter, we have the corresponding coefficient estiamte:
cv.sprinter
The function cv.sprinter()
performs cross-validation to select the value of lasso tuning parameter lambda
used in Step 3, while holding the value of num_keep
fixed.
The output of cv.sprinter
is a S3
object. The most intersting information is mod_cv$compact
, which is a matrix of three columns. The first two columns show the index pairs of all variables finally selected by the lasso in Step 3, and the last column is the coefficient estimate corresponding to that selected variable.
mod_cv$compact
#> index_1 index_2 coefficient
#> [1,] 0 1 0.988300175
#> [2,] 0 2 -1.454164327
#> [3,] 0 4 -0.217993186
#> [4,] 0 36 0.070174151
#> [5,] 0 59 -0.028997635
#> [6,] 0 66 -0.076081341
#> [7,] 0 87 -0.015292763
#> [8,] 0 94 0.071703217
#> [9,] 0 112 0.032026436
#> [10,] 0 123 -0.028004284
#> [11,] 0 157 -0.167206693
#> [12,] 158 175 0.025534472
#> [13,] 141 168 -0.006547523
#> [14,] 3 5 -0.155987560
#> [15,] 95 194 0.049738621
#> [16,] 5 97 0.091145031
#> [17,] 4 29 -0.050334021
#> [18,] 17 77 0.023256339
#> [19,] 30 168 -0.052412220
#> [20,] 115 175 -0.027360353
#> [21,] 1 3 2.471482722
#> [22,] 4 5 -3.583816424
We see (from the first two rows and the last two rows) that the fit selected by cross-validation includes all the four important variables in the model, with relatively accurate estimates of their coefficients.
Finally, there is a predict
function for the S3
object returned by cv.sprinter
that computes the prediction for a new data matrix of main effects: