This page provides installation instruction and usage examples for the R package ADPclust. Please see the paper for details of the procedure:
Most recent developments are in this GitHub repo.
ADPclust is a non-iterative procedure that finds the number of clusters and cluster assignments of large amount of high dimensional data by identifying cluster centroids from estimated local densities. The procedure is built upon the work by Rodriguez [2014]. ADPclust automatically identifies cluster centroids from a projected two dimensional decision plot that separates cluster centroids from the rest of the points. This decision plot is generated from the local density \(f(\mathbf{x})\) and an “isolation” score \(\delta(\mathbf{x})\) for each data point \(\mathbf{x}\).
For a data set \(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\) where each \(\mathbf{x}_i\) is a d dimensional vector, ADPclust first estimates the local multivariate Gaussian density \(f(\mathbf{x}_i), i=1,\ldots,n\) by
\[\hat{f}(\mathbf{x}_i; h_1,...h_d) = n^{-1} \left(\prod_{l=1}^d h_l \right)^{-1} \cdot \sum_{j=1}^n K\left(\frac{x_{i1} - x_{j1}}{h_1}, ..., \frac{x_{id} - x_{jd}}{h_d}\right). \] where \(h_1,...,h_d\) are bandwidths at each dimension. Two default \(h\) values are provided in ADPclust:
Other bandwidths can also be specified if the default values don’t give satisfactory results.
Given density estimation \(\hat{f}(\mathbf{x}_i), i = 1,...,n\), the “isolation” indices \(\delta(\mathbf{x}_i)'s\) are found by:
\[\hat{\delta}(\mathbf{x}_i) = \min_{j:\hat{f}(\mathbf{x}_i) < \hat{f}(\mathbf{x}_j)}{d(\mathbf{x}_i,\mathbf{x}_j)}.\]
where \(d(\mathbf{x}_i,\mathbf{x}_j)\) is the distance between \(\mathbf{x}_i\) and \(\mathbf{x}_j\).
The scatter plot of \((\hat{f}(\mathbf{x}_i), \hat{\delta}(\mathbf{x}_i)), i = 1,...,n\) is called a decision plot, from which \(k\) centroids are selected automatically or manually from the upper-right corner, and all other points are clustered according to their distances to the closest centroid.
The average silhouette score is calculated after clusters are assigned, and is used to chose the best number of clusters among a sequence of testing \(k\)’s.
Run the following line to install the package.
install.packages("ADPclust", repos = "http://cran.us.r-project.org")
Run the following line to load the package.
library(ADPclust)
The automatic centroids selection by ADPclust finds the best bandwidth \(h\) and number of clusters \(k\) from a grid of \((h,k)\) pairs. By default, the testing \(h's\) are 10 values evenly spread in the interval \([1/3h_0, 3h_0]\), where \(h_0\) is the Wand’s asymptotic mean integrated squared error bandwidth (AMISE). The default testing cluster numbers are \(k = 2,\ldots,10\). Here is a simple example:
# Load a simple simulated data set with 3 clusters.
data(clust3)
ans <- adpclust(clust3)
# Above is equivalent to
# ans <- adpclust(clust3, centroids = "auto")
plot(ans)
The output of ADPclust ans
is an object of class adpclust
associated with a summary
and a plot
method. plot(ans)
produces a figure similar to the one shown above. summary(ans)
gives a fitting summary:
summary(ans)
## -- ADPclust Result --
##
## Number of obs.: 90
## Centroids selection: auto
## Number of clusters: 3
## Avg. Silhouette: 0.7747114
## Elements in result: $clusters $centers $silhouette $nclust $h $f $delta $selection.type $tested
## f(x):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.444 5.048 7.907 7.599 10.300 13.210
##
## delta(x):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02322 0.05521 0.09144 0.15980 0.13480 2.06300
The input can be a distance matrix of class dist
instead of the raw data frame. Note that if centroid = "auto"
(default), then the dimension p
must be provided to calculate h
.
# A simple wrapper of dist() with normalization
distm <- FindDistm(clust3, normalize = TRUE)
ans.distm <- adpclust(distm = distm, p = 2)
The reference bandwidth \(h_0\) can be changed to Scott’s rule-of-thumb (ROT) value by setting htype = "ROT"
:
# Result is similar. Not shown.
ans <- adpclust(clust3, htype = "ROT")
Passing a specific value to the optional argument h
specifies the bandwidth and suppresses htype
. If a numeric vector is passed to h
then every entry of it is tested to find the one given the best clustering result, according to average silhouette. Note h
is a relative value so manually setting its value often requires trial and error.
# Setting a single h. Result not shown.
ans <- adpclust(clust3, h = 10)
# Setting a vector of testing h's. Result not shown.
ans <- adpclust(clust3, h = c(10, 12, 18))
# Setting h to the 'ROT' bandwidth. result not shown.
ans <- adpclust(clust3, h = ROT(clust3))
The number of (testing) cluster(s) can be set by the nclust
argument.
# Setting different testing cluster numbers
ans <- adpclust(clust3, nclust = 2:15)
# Specifying one cluster number.
ans <- adpclust(clust3, nclust = 3)
plot(ans, to.plot = "fd")
Another important argument is f.cut
, denoting the cutoff value of \(f's\) (red dotted line in the middle figure) for centroid/outlier discrimination. Points to the right of the line with high \(delta's\) are potential cluster centroids. Points to the left of it with high \(delta's\) are potential outliers. f.cut
is the percentile value of \(f\) with default value at 10%.
# Load a data set with 10 clusters
data(clust10)
ans <- adpclust(clust10, f.cut = 0.1, nclust = 5:13, h = ROT(clust10))
plot(ans)
Setting f.cut
to different values could result in different cluster assignment. In the following case f.cut
is obviously set too high:
ans <- adpclust(clust10, f.cut = 0.95, nclust = 5:13, h = ROT(clust10))
plot(ans)
ADPclust also allow user to interactively select cluster centroids from the \((f(x), \delta(x))\) decision scatter plot. After running the following line, the first figure below is displayed, on which you can click arbitrary number of centroids, then hit “ESC” to end selection. The right figure then shows the corresponding clustering result.
data(clust5.1)
ans <- adpclust(clust5.1, centroids = "user")