The TDAvec
package provides implementations of several
vector summaries of persistence diagrams studied in Topological Data
Analysis (TDA). Each is obtained by discretizing the associated summary
function computed from a persistence diagram. The summary functions
included in this package are
For improved computational efficiency, all code behind the vector
summaries is written in C++
using the Rcpp
package. Whenever applicable, when compare our code with existing
implementations in terms of accuracy and run-time cost. In this
vignette, we illustrate the basic usage of the TDAvec
package using simple examples.
Let’s first load the required libraries.
library(TDAvec)
library(TDA) # to compute persistence diagrams
library(microbenchmark) # to compare computational costs
Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise:
<- 100 # point cloud size
N set.seed(123)
<- circleUnif(N) + rnorm(2*N,mean = 0,sd = 0.2)
X # plot the point cloud
plot(X,pch = 20,asp = 1)
Next, we use the TDA
package to compute the persistence
diagram (PD) using the Vietoris-Rips filtration built on top of the
point cloud \(X\).
<- ripsDiag(X,maxdimension = 1,maxscale = 2)$diagram
D sum(D[,1]==0) # number of connected components
#> [1] 100
sum(D[,1]==1) # number of loops
#> [1] 13
sum(D[,1]==2) # number of voids
#> [1] 0
In the ripsDiag()
function, maxdimension
is
the maximum homological dimension of the topological features to be
computed (connected components if maxdimension=0; connected components
and loops if 1; connected components, loops and voids if 2, etc.)
maxscale
is the maximum value of the scale parameter of the
filtration (which we set equal to 2 since the points are sampled from a
circle with diameter 2).
The persistence diagram \(D\) has
100 connected components (the same as the point cloud size), 13 loops
(one with long life-span, the rest are short-lived) and 0 voids along
with their birth
and death
values. To plot the
diagram, we can use the plot()
function.
plot(D)
In the plot, the solid dots and red triangles represent connected components and loops respectively.
Let’s compute a vector summary of the persistence landscape (PL) for homological dimension \(H_0\).
# sequence of scale values to vectorize the summary function
= seq(0,2,length.out=11)
scaleSeq # compute the PL vector summary for homological dimension H_0
computePL(D,homDim = 0,scaleSeq,k=1)
#> [1] 0.0 0.2 0.4 0.6 0.8 1.0 0.8 0.6 0.4 0.2 0.0
Here, the vectorization is performed by evaluating the PL function at
each element of scaleSeq
(i.e. \(0.0,0.2,0.4,\ldots,2.0\)) and arranging the
values into a vector. The parameter k
in
computePL()
is the order of the persistence landscape
function (by default \(k=1\)).
To compute an \(H_1\) vector
summary, set the homDim
argument equal to 1:
# compute the PL vectory summary for homological dimension H_1
computePL(D,homDim = 1,scaleSeq,k=1)
#> [1] 0.000000000 0.014217630 0.000931983 0.191114536 0.391114536 0.269212150
#> [7] 0.069212150 0.000000000 0.000000000 0.000000000 0.000000000
The TDA
package also provides an implementation of the
persistence landscapes. Below we compare the two implementations in
terms of accuracy of results and run-time cost.
<- computePL(D,homDim = 0,k=1,scaleSeq)
pl1 <- as.vector(landscape(D,dimension = 0,KK = 1, tseq = scaleSeq))
pl2 all.equal(pl1,pl2) # -> TRUE (the results are the same)
#> [1] TRUE
<- microbenchmark(
compCost computePL(D,homDim = 0,k=1,scaleSeq),
landscape(D,dimension = 0,KK = 1, tseq = scaleSeq),
times = 500
)<- summary(compCost)
sm <- sm$mean[2]/sm$mean[1] # ratio of computational time means costRatioPL
For homological dimension \(H_0\),
it took TDA::landscape()
about 228 times more time than
TDAvec::computePL()
.
To discretize all the other univariate summary functions (i.e., persistence silhouette, persistent entropy summary function, Euler characteristic curve, normalized life curve and Betti curve), we employ a different vectorization scheme. Instead of evaluating a summary function at increasing scales and arranging the values into a vector, we compute the average values of the summary function between two consecutive scale points using integration. More specifically, if \(f\) is a (univariate) summary function and \(t_1,t_2,\ldots,t_n\) are increasing scale values, we discretize \(f\) as:
\[\begin{equation} \Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1}, \end{equation}\]
where \(\Delta t_k=t_{k+1}-t_k\), \(k=1,\ldots,n-1\). For the above five summary functions, the computation of integrals can be done analytically and efficiently implemented. Note that in this case, vector summaries lie in \(\mathbb{R}^{n-1}\), where the dimension is one less than the number of scale values.
The TDA::silhouette()
function vectorizes the
persistence silhouette function the same way as the
TDA::landscape()
function does. However,
TDA::silhouette()
and TDAvec::computePS()
provide similar results if a dense grid of scale values is used for
vectorization:
<- 101
n = seq(0,2,length.out=n)
scaleSeq <- computePS(D,homDim = 0, p = 1,scaleSeq)
ps1 <- as.vector(silhouette(D,dimension = 0,p = 1,tseq = scaleSeq))
ps2
# plot two vector summaries
plot(scaleSeq[1:(n-1)]+1/(n-1),ps1,
type="l",col="red",xlab="x",ylab="y",lty=1)
lines(scaleSeq,ps2,type='l',col='blue',lty=2)
legend(1.48, 0.122, legend=c("TDAvec","TDA"),
col=c("red","blue"),lty=1:2,cex=0.7)
<- microbenchmark(
compCost computePS(D,homDim = 0,p = 1,scaleSeq),
silhouette(D,dimension = 0,p = 1,tseq = scaleSeq),
times = 500
)<- summary(compCost)
sm <- sm$mean[2]/sm$mean[1]
costRatioPS print(costRatioPS)
#> [1] 3.554547
The \(p\) in
TDAvec::computePS()
and TDA::silhouette()
is
the power of the weights for the silhouette function (by default \(p=1\)). For the above example, the former
was about 4 times faster to than the latter for homological dimension
\(H_0\).
The syntax and usage of the remaining univariate summary functions are very similar.
# sequence of scale values to vectorize the summary function
= seq(0,2,length.out=11)
scaleSeq
# Persistent Entropy Summary (PES) function
# compute PES for homological dimension H0
computePES(D,homDim = 0,scaleSeq)
#> [1] 4.8268301 1.0173158 0.3720045 0.3720045 0.3720045 0.3720045 0.3720045
#> [8] 0.3720045 0.3720045 0.3720045
# compute PES for homological dimension H1
computePES(D,homDim = 1,scaleSeq)
#> [1] 0.05677674 0.26458225 0.33845861 0.34789260 0.34789260 0.34789260
#> [7] 0.12039198 0.00000000 0.00000000 0.00000000
# Euler Characteristic Curve (ECC)
computeECC(D,maxhomDim = 1,scaleSeq) # maxhomDim = maximal homological dimension considered
#> [1] 65.64393340 5.93893485 -0.02801848 0.00000000 0.00000000 0.00000000
#> [7] 0.65393925 1.00000000 1.00000000 1.00000000
# Vector of Averaged Bettis (VAB) - a vectorization of Betti Curve
# compute VAB for homological dimension H0
computeVAB(D,homDim = 0,scaleSeq)
#> [1] 65.928137 7.313179 1.000000 1.000000 1.000000 1.000000 1.000000
#> [8] 1.000000 1.000000 1.000000
# compute VAB for homological dimension H1
computeVAB(D,homDim = 1,scaleSeq)
#> [1] 0.2842034 1.3742440 1.0280185 1.0000000 1.0000000 1.0000000 0.3460608
#> [8] 0.0000000 0.0000000 0.0000000
# Normalized Life (NL) Curve
# compute NL for homological dimension H0
computeNL(D,homDim = 0,scaleSeq)
#> [1] 0.8171309 0.2341008 0.1230901 0.1230901 0.1230901 0.1230901 0.1230901
#> [8] 0.1230901 0.1230901 0.1230901
# compute NL for homological dimension H1
computeNL(D,homDim = 1,scaleSeq)
#> [1] 0.01309940 0.06318557 0.68241758 0.71307326 0.71307326 0.71307326
#> [7] 0.24676667 0.00000000 0.00000000 0.00000000
To discretize the persistence surface and persistence block, we first need to switch from the birth-death to the birth-persistence coordinates.
3] <- D[,3] - D[,2]
D[,colnames(D)[3] <- "Persistence"
The resulting vector summaries are called the persistence image (PI) and the vectorized of persistence block (VPB) respectively.
# Persistence Image (PI)
<- 5 # resolution (or grid size) along the birth axis
resB <- 5 # resolution (or grid size) along the persistence axis
resP # find min and max persistence values
<- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3])
minPH0 # construct one-dimensional grid of scale values
<- seq(minPH0,maxPH0,length.out=resP+1)
ySeqH0 # default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram
<- 0.5*(maxPH0-minPH0)/resP
sigma # compute PI for homological dimension H_0
computePI(D,homDim=0,xSeq=NA,ySeqH0,sigma)
#> [1] 4.56670703 0.96562735 0.01135007 0.02272340 0.47724987
# Vectorized Persistence Block (VPB)
<- 0.3 # parameter in [0,1] which controls the size of blocks around each point of the diagram
tau # compute VPB for homological dimension H_0
computeVPB(D,homDim = 0,xSeq=NA,ySeqH0,tau)
#> [1] 3.90196609 0.06216766 0.00000000 0.77425106 1.80204108
Since the \(H_0\) features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization.
For homological dimension \(H_1\), the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid. For the VPB summary, since the blocks around each point of the persistence diagram have different sizes, we construct the grid with scale values spread out non-uniformly (i.e. the rectangular grid cells have different dimensions). In experiments, this construction of the grid tends to provide better performance over the grid with equal cell sizes.
# PI
# find min & max birth and persistence values
<- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2])
minBH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3])
minPH1 <- seq(minBH1,maxBH1,length.out=resB+1)
xSeqH1 <- seq(minPH1,maxPH1,length.out=resP+1)
ySeqH1 <- 0.5*(maxPH1-minPH1)/resP
sigma # compute PI for homological dimension H_1
computePI(D,homDim=1,xSeqH1,ySeqH1,sigma)
#> [1] 4.371601e-02 4.955567e-02 4.512378e-02 3.299957e-02 1.864442e-02
#> [6] 7.527654e-03 7.760847e-03 6.305986e-03 4.250972e-03 2.278547e-03
#> [11] 6.038006e-05 5.841014e-05 4.427399e-05 3.094726e-05 2.021064e-05
#> [16] 1.834675e-04 8.921009e-04 2.714215e-03 5.171823e-03 6.175378e-03
#> [21] 3.853840e-03 1.874023e-02 5.701774e-02 1.086451e-01 1.297270e-01
# VPB
<- unique(quantile(D[D[,1]==1,2],probs = seq(0,1,by=0.2)))
xSeqH1 <- unique(quantile(D[D[,1]==1,3],probs = seq(0,1,by=0.2)))
ySeqH1 <- 0.3
tau # compute VPB for homological dimension H_1
computeVPB(D,homDim = 1,xSeqH1,ySeqH1,tau)
#> [1] 0.0000000000 0.0004995598 0.0090337733 0.0305098818 0.0057973687
#> [6] 0.0095194496 0.0261053157 0.0203673036 0.0042870130 0.0107446149
#> [11] 0.0425768331 0.1053468321 0.1458807241 0.0000000000 0.0000000000
#> [16] 0.0328035481 0.0276785592 0.1089088594 0.1318822254 0.0168304895
#> [21] 0.2939394561 0.3162986126 0.3344510989 0.3567486932 0.3636226941
As a note, the code for computePI()
is adopted from the
pers.image()
function (available in the
kernelTDA
package) with minor modifications. For example,
pers.image()
uses a one-dimensional grid for homological
dimension \(H_0\) regardless of the
filtration type. In contrast, computePI()
uses a
one-dimensional grid only if additionally the birth values are the same
(which may not be true for some filtrations such as the lower-star
filtration). Moreover, pers.image()
uses a square grid
(e.g., 10 by 10) for vectorization, whereas computePI()
is
not limited to such a grid and can compute vector summaries using a
rectangular grid (e.g., 10 by 20).
Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1), 77-102.
Chazal, F., Fasy, B. T., Lecci, F., Rinaldo, A., & Wasserman, L. (2014, June). Stochastic convergence of persistence landscapes and silhouettes. In Proceedings of the thirtieth annual symposium on Computational geometry (pp. 474-483).
Atienza, N., Gonzalez-Díaz, R., & Soriano-Trigueros, M. (2020). On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recognition, 107, 107509.
Richardson, E., & Werman, M. (2014). Efficient classification using the Euler characteristic. Pattern Recognition Letters, 49, 99-106.
Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Frontiers in Artificial Intelligence, 108.
Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. Advances in Computational Mathematics, 48(1), 1-42.
Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., … & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18.
Chan, K. C., Islambekov, U., Luchinsky, A., & Sanders, R. (2022). A computationally efficient framework for vector representation of persistence diagrams. Journal of Machine Learning Research, 23, 1-33.