A gentle introduction to group sequential design

Introduction

This article is intended to give a gentle mathematical and statistical introduction to group sequential design. We also provide relatively simple examples from the literature to explain clinical applications. There is no programming shown, but by accessing the source for the article all required programming can be accessed; substantial commenting is provided in the source in the hope that users can understand how to implement the concepts developed here. Hopefully, the few mathematical and statistical concepts introduced will not discourage those wishing to understand some underlying concepts for group sequential design.

A group sequential design enables repeated analysis of an endpoint for a clinical trial to enable possible early stopping of a trial for either a positive result, for futility, or for a safety issue. This approach can

limit exposure risk to patients and clinical trial investment past the time where known unacceptable safety risks have been established for the endpoint of interest,
limit investment in a trial where interim results suggest further evaluation for a positive efficacy finding is futile, or
accelerate the availability of a highly effective treatment by enabling early approval following an early positive finding.

Examples of outcomes that might be considered include:

a continuous outcome such as change from baseline at some fixed follow-up time in the HAM-D depression score,
absolute or difference or risk ratio for a response rate (e.g., in oncology) or failure rate for a binary (yes/no) outcome, and
a hazard ratio for a time-to-event out such such as time-to-death or disease progression in an oncology trial or for time until a cardiovascular event (death, myocardial infarction or unstable angina).

Examples of the above include:

a new treatment for major depression where an interim analysis of a continuous outcome stopped the trial for futility (Binneman et al. (2008)),
a new treatment for patients with unstable angina undergoing balloon angioplasty with a positive interim finding for a binary outcome of death, myocardial infarction or urgent repeat intervention within 30 days (The CAPTURE Investigators (1997)), and
a new treatment for patients with lung cancer based on a positive interim finding for time-to-death (Gandhi et al. (2018)).

Group sequential design framework

We assume

A two-arm clinical trial with a control and experimental group.
There are \(k\) analyses planned for some integer \(k> 1.\)
There is a natural parameter \(\delta\) describing the underlying treatment difference with an estimate that has an asymptotically normal and efficient estimate \(\hat\delta_j\) with variance \(\sigma_j^2\) and corresponding statistical information \(\mathcal{I}_j=1/\sigma_j^2\), at analysis \(j=1,2,\ldots,k\). A positive value favoring experimental treatment and negative value favoring control. We assume a consistent estimate \(\hat\sigma_j^2\) of \(\sigma_j^2, j=1,2,\ldots,k\).
The information fraction is defined as \(t_j=\mathcal{I}_i/\mathcal{I}_j\) at analysis \(j=1,\ldots,k\).
Correlations between estimates at different analyses are \(\text{Corr}(\hat\delta_i,\hat\delta_j)=\sqrt{\mathcal{I}_i/\mathcal{I}_j}=\sqrt{t_j}\) for \(1\le i\le j\le k.\)
There is a test test \(Z_j\approx\hat\delta_j/\hat{\sigma}^2_j.\)

For a time-to-event outcome, \(\delta\) would typically represent the logarithm of the hazard ratio for the control group versus the experimental group. For a difference in response rates, \(\delta\) would represent the underlying response rates. For a continuous outcome such as the HAM-D, we would examine the difference in change from baseline at a milestone time point (e.g., at 6 weeks as in Binneman et al. (2008)). For \(j=1,\ldots,k\), the tests \(Z_j\) are asymptotically multivariate normal with correlations as above, and for \(i=1,\ldots,k\) have \(\text{Cov}(Z_i,Z_j)=\text{Corr}(\hat\delta_i,\hat\delta_j)\) and \(E(Z_j)=\delta\sqrt{I_j}.\)

This multivariate asymptotic normal distribution for \(Z_1,\ldots,Z_k\) is referred to as the canonical form by Jennison and Turnbull (2000) who have also summarized much of the surrounding literature.

Bounds for testing

One-sided testing

We assume that the primary test the null hypothesis \(H_{0}\): \(\delta=0\) against the alternative \(H_{1}\): \(\delta = \delta_1\) for a fixed effect size \(\delta_1 > 0\) which represents a benefit of experimental treatment compared to control. We assume further that there is interest in stopping early if there is good evidence to reject one hypothesis in favor of the other. For \(i=1,2,\ldots,k-1\), interim cutoffs \(l_{i}< u_{i}\) are set; final cutoffs \(l_{k}\leq u_{k}\) are also set. For \(i=1,2,\ldots,k\), the trial is stopped at analysis \(i\) to reject \(H_{0}\) if \(l_{j}<Z_{j}< u_{j}\), \(j=1,2,\dots,i-1\) and \(Z_{i}\geq u_{i}\). If the trial continues until stage \(i\), \(H_{0}\) is not rejected at stage \(i\), and \(Z_{i}\leq l_{i}\) then \(H_{1}\) is rejected in favor of \(H_{0}\), \(i=1,2,\ldots,k\). Thus, \(3k\) parameters define a group sequential design: \(l_{i}\), \(u_{i}\), and \(\mathcal{I}_{i}\), \(i=1,2,\ldots,k\). Note that if \(l_{k}< u_{k}\) there is the possibility of completing the trial without rejecting \(H_{0}\) or \(H_{1}\). We will often restrict \(l_{k}= u_{k}\) so that one hypothesis is rejected.

We begin with a one-sided test. In this case there is no interest in stopping early for a lower bound and thus \(l_i= -\infty\), \(i=1,2,\ldots,k\). The probability of first crossing an upper bound at analysis \(i\), \(i=1,2,\ldots,k\), is

\[\alpha_{i}^{+}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{Z_{j}< u_{j}\}\}\]

The Type I error is the probability of ever crossing the upper bound when \(\delta=0\). The value \(\alpha^+_{i}(0)\) is commonly referred to as the amount of Type I error spent at analysis \(i\), \(1\leq i\leq k\). The total upper boundary crossing probability for a trial is denoted in this one-sided scenario by

\[\alpha^+(\delta) \equiv \sum_{i=1}^{k}\alpha^+_{i}(\delta)\]

and the total Type I error by \(\alpha^+(0)\). Assuming \(\alpha^+(0)=\alpha\) the design will be said to provide a one-sided group sequential test at level \(\alpha\).

Asymmetric two-sided testing

With both lower and upper bounds for testing and any real value \(\delta\) representing treatment effect we denote the probability of crossing the upper boundary at analysis \(i\) without previously crossing a bound by

\[\alpha_{i}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{ l_{j}<Z_{j}< u_{j}\}\},\]

\(i=1,2,\ldots,k.\) The total probability of crossing an upper bound prior to crossing a lower bound is denoted by

\[\alpha(\delta)\equiv\sum_{i=1}^{k}\alpha_{i}(\delta).\]

Next, we consider analogous notation for the lower bound. For \(i=1,2,\ldots,k\) denote the probability of crossing a lower bound at analysis \(i\) without previously crossing any bound by \[\beta_{i}(\delta)=P_{\delta}\{\{Z_{i}\leq l_{i}\}\cap_{j=1}^{i-1}\{ l_{j} <Z_{j}< u_{j}\}\}.\] The total lower boundary crossing probability in this case is written as \[\beta(\delta)= {\sum\limits_{i=1}^{k}} \beta_{i}(\delta).\]

When a design has final bounds equal (\(l_k=u_k\)), \(\beta(\delta_1)\) is the Type II error which is equal to 1 minus the power of the design. In this case, \(\beta_i(\delta)\) is referred to as the \(\beta\)-spending at analysis \(i, i=1,\ldots,k\).

Spending function design

Type I error is most often defined with \(\alpha_i^+(0), i=1,\ldots,k\). This is referred to as non-binding Type I error since any lower bound is ignored in the calculation. This means that if a trial is continued in spite of a lower bound being crossed at an interim analysis that Type I error is still controlled at the design \(\alpha\)-level. For Phase III trials used for approvals of new treatments, non-binding Type I error calculation is generally expected by regulators.

For any given \(0<\alpha<1\) we define a non-decreasing \(\alpha\)-spending function \(f(t; \alpha)\) for \(t\geq 0\) with \(\alpha\left( 0\right) =0\) and for \(t\geq 1\), \(f( t; \alpha) =\alpha\). Letting \(t_0=0\), we set \(\alpha_j(0)\) for \(j=1,\ldots,k\) through the equation \[\alpha^+_{j}(0) = f(t_j;\alpha)-f(t_{j-1}; \alpha).\] Assuming an asymmetric lower bound, we similarly use a \(\beta\)-spending function and to set \(\beta\)-spending at analysis \(j=1,\ldots, k\) as: \[\beta_{j}(\delta_1) = g(t_j;\delta_1, \beta) - g(t_{j-1};\delta_1, \beta).\]

In the following example, the function \(\Phi()\) represents the cumulative distribution function for the standard normal distribution function (i.e., mean 0, standard deviation 1). The major depression study of Binneman et al. (2008) considered above used the Lan and DeMets (1983) spending function approximating an O’Brien-Fleming bound for a single interim analysis half way through the trial with

\[f(t; \alpha) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\alpha/2)}{\sqrt{t}}\right) \right).\]

\[g(t; \beta) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\beta/2)}{\sqrt{t}}\right) \right).\]

library(gsDesign)

delta1 <- 3 # Treatment effect, alternate hypothesis
delta0 <- 0 # Treatment effect, null hypothesis
ratio <- 1 # Randomization ratio (experimental / control)
sd <- 7.5 # Standard deviation for change in HAM-D score
alpha <- 0.1 # 1-sided Type I error
beta <- 0.17 # Targeted Type II error (1 - targeted power)
k <- 2 # Number of planned analyses
test.type <- 4 # Asymmetric bound design with non-binding futility bound
timing <- .5 # information fraction at interim analyses
sfu <- sfLDOF # O'Brien-Fleming spending function for alpha-spending
sfupar <- 0 # Parameter for upper spending function
sfl <- sfLDOF # O'Brien-Fleming spending function for beta-spending
sflpar <- 0 # Parameter for lower spending function
delta <- 0
endpoint <- "normal"

# Derive normal fixed design sample size
n <- nNormal(
  delta1 = delta1,
  delta0 = delta0,
  ratio = ratio,
  sd = sd,
  alpha = alpha,
  beta = beta
)

# Derive group sequential design based on parameters above
x <- gsDesign(
  k = k,
  test.type = test.type,
  alpha = alpha,
  beta = beta,
  timing = timing,
  sfu = sfu,
  sfupar = sfupar,
  sfl = sfl,
  sflpar = sflpar,
  delta = delta, # Not used since n.fix is provided
  delta1 = delta1,
  delta0 = delta0,
  endpoint = "normal",
  n.fix = n
)
# Convert sample size at each analysis to integer values
x <- toInteger(x)

The planned design used \(\alpha=0.1\), one-sided and Type II error 17% (83% power) with an interim analysis at 50% of the final planned observations. This leads to Type I \(\alpha\)-spending of 0.02 and \(\beta\)-spending of 0.052 at the planned interim. An advantage of the spending function approach is that bounds can be adjusted when the number of observations at analyses are different than planned. The actual observations for experimental versus control at the analysis were 59 as opposed to the planned 67, which resulted in interim spending fraction \(t_1=\) 0.4403. With the Lan-DeMets spending function to approximate O’Brien-Fleming bounds this results in \(\alpha\)-spending of 0.0132 (P(Cross) if delta=0 row in Efficacy column) and \(\beta\)-spending of 0.0386 (P(Cross) if delta=3 row in Futility column). We note that the Z-value and 1-sided p-values in the table below correspond exactly and either can be used for evaluation of statistical significance for a trial result. The rows labeled ~delta at bound are approximations that describe approximately what treatment difference is required to cross a bound; these should not be used for a formal evaluation of whether a bound has been crossed. The O’Brien-Fleming spending function is generally felt to provide conservative bounds for stopping at interim analysis. Most of the error spending is reserved for the final analysis in this example. The futility bound only required a small trend in the wrong direction to stop the trial; a nominal p-value of 0.77 was observed which crossed the futility bound, stopping the trial since this was greater than the futility p-value bound of 0.59. Finally, we note that at the final analysis, the cumulative probability for P(Cross) if delta=0 is less than the planned \(\alpha=0.10\). This probability represents \(\alpha(0)\) which excludes the probability of crossing the lower bound at the interim analysis and the final analysis. The value of the non-binding Type I error is still \(\alpha^+(0) = 0.10\).

# Updated alpha is unchanged
alphau <- 0.1
# Updated sample size at each analysis
n.I <- c(59, 134)
# Updated number of analyses
ku <- length(n.I)
# Information fraction is used for spending
usTime <- n.I / x$n.I[x$k]
lsTime <- usTime

# Update design based on actual interim sample size and planned final sample size
xu <- gsDesign(
  k = ku,
  test.type = test.type,
  alpha = alphau,
  beta = x$beta,
  sfu = sfu,
  sfupar = sfupar,
  sfl = sfl,
  sflpar = sflpar,
  n.I = n.I,
  maxn.IPlan = x$n.I[x$k],
  delta = x$delta,
  delta1 = x$delta1,
  delta0 = x$delta0,
  endpoint = endpoint,
  n.fix = n,
  usTime = usTime,
  lsTime = lsTime
)

# Summarize bounds
gsBoundSummary(xu, Nname = "N", digits = 4, ddigits = 2, tdigits = 1)

#>   Analysis               Value Efficacy Futility
#>  IA 1: 44%                   Z   2.2209  -0.2304
#>      N: 59         p (1-sided)   0.0132   0.5911
#>                ~delta at bound   4.3370  -0.4500
#>            P(Cross) if delta=0   0.0132   0.4089
#>            P(Cross) if delta=3   0.2468   0.0386
#>      Final                   Z   1.3047   1.3047
#>     N: 134         p (1-sided)   0.0960   0.0960
#>                ~delta at bound   1.6907   1.6907
#>            P(Cross) if delta=0   0.0965   0.9035
#>            P(Cross) if delta=3   0.8350   0.1650