Title: | Intrinsic Dimension Estimation |
---|---|
Description: | A variety of methods for estimating intrinsic dimension of data sets (i.e the manifold or Hausdorff dimension of the support of the distribution that generated the data) as reviewed in Johnsson, K. (2016, ISBN:978-91-7623-921-6) and Johnsson, K., Soneson, C. and Fontes, M. (2015) <doi:10.1109/TPAMI.2014.2343220>. Furthermore, to evaluate the performance of these estimators, functions for generating data sets with given intrinsic dimensions are provided. |
Authors: | Kerstin Johnsson, Lund University |
Maintainer: | Kerstin Johnsson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.0 |
Built: | 2025-02-22 03:55:07 UTC |
Source: | https://github.com/kjohnsson/intrinsicdimension |
The intrinsic dimension of a data set is a measure of its complexity. In technical terms it typically means the manifold or Hausdorff (fractal) dimension of the support of the probability distribution generating the data. This package contains functions for estimating intrinsic dimension and generating ground truth data sets with known intrinsic dimension.
Data sets that can be accurately described with a few parameters have low intrinsic dimension. It is expected that the performance of many machine learning algorithms is dependent on the intrinsic dimension of the data. Is has also been proposed to use estimates of intrinsic dimension for applications such as network anomaly detection and image analysis.
This package contains implementations of a variety of approaches for intrinsic dimension estimation: modeling distances by for example Maximum Likelihood, approximating hyperplanes using Principal Component Analysis (PCA) and modeling angular information and concentration of measure (ESS and DANCo methods). Ground truth data, i.e. data with known intrinsic dimension, can be generated with a number of functions modeling manifolds. The manifold dimension is the intrinsic dimension.
The package distinguishes between local, global and pointwise estimators of intrinsic dimension. Local estimators estimate dimension of a _local data set_, for example a neighborhood from a larger data set. For this estimate to be accurate the noise and the curvature of the data has to be small relative to the neighborhood diameter. A global estimator takes the entire data set and returns one estimate of intrinsic dimension. Global estimators has the potential to handle higher noise and curvature levels than local estimators, but require that the entire data set has the same intrinsic dimension. Pointwise estimators are essentially local estimators applied neighborhoods around each point in the data set, but sometimes information beyond the neighborhood is used, as in PCA with Optimally Topology Preserving Maps. Any local estimator can be converted into a pointwise estimator.
Functions for estimating intrinsic dimension: localIntrinsicDimension
, globalIntrinsicDimension
, pointwiseIntrinsicDimension
, essLocalDimEst
, dancoDimEst
, pcaLocalDimEst
, pcaOtpmPointwiseDimEst
,
maxLikGlobalDimEst
, maxLikLocalDimEst
, maxLikPointwiseDimEst
, knnDimEst
.
Functions for generating data points from (usually uniform) distributions on
manifolds (possibly with noise): hyperBall
, hyperSphere
, hyperCube
,
isotropicNormal
, hyperCubeFaces
, hyperCubeEdges
, cutHyperPlane
, cutHyperSphere
, oblongNormal
, swissRoll
, swissRoll3Sph
, twinPeaks
, hyperTwinPeaks
,
cornerPlane
, mHeinManifold
, m14Manifold
, m15Manifold
.
Functions for applying local estimators to non-local data: asPointwiseEstimator
, neighborhoods
Kerstin Johnsson, Lund University
Maintainer: Kerstin Johnsson <[email protected]>
Johnsson, K (2016). Structures in high-dimensional data: Intrinsic dimension and cluster analysis. PhD thesis. Lund University. http://portal.research.lu.se/portal/sv/publications/structures-in-highdimensional-data-intrinsic-dimension-and-cluster-analysis(8404f72e-e760-436d-ad7f-1be15af4b3d1).html
Embeds the data in n
dimensions and adds normal isotropic noise to
the data set. Hence n
has to be at least equal to the dimension (the
number of columns) of the data set, otherwise the function terminates with
an error.
addNoise(data, n = ncol(data), sd)
addNoise(data, n = ncol(data), sd)
data |
data set. Each row corresponds to a data point. |
n |
dimension of noise. |
sd |
standard deviation of noise. The covariance matrix of the noise
is |
Matrix of same size as data
.
Kerstin Johnsson, Lund University
datap <- hyperCubeEdges(100, 1, 2) datap <- addNoise(datap, 3, .05) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
datap <- hyperCubeEdges(100, 1, 2) datap <- addNoise(datap, 3, .05) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
Returns a function that can be used as a pointwise estimator of intrinsic dimension that uses local data sets with a fixed number of data points.
asPointwiseEstimator(estimator, neighborhood.size, indices=NULL, eps=0.0)
asPointwiseEstimator(estimator, neighborhood.size, indices=NULL, eps=0.0)
estimator |
A local intrinsic dimension estimator. |
neighborhood.size |
The number of neighbors used for each dimension estimate. |
indices |
A vector with indices of the points in |
eps |
If non-zero, the relative error in distance allowed when finding nearest neighbors. See Details. |
The ann
function of the package yaImpute
is used for finding the k
nearest neighbors. The eps
parameter to neighborhoods
is used in the ann
function.
A function that can be used as a pointwise dimension estimator.
Kerstin Johnsson, Lund University
data <- swissRoll3Sph(300, 300) # the first 300 data points are on the swiss roll (ID=2) , the last 300 on the 3-sphere (ID=3) essPointwiseDimEst <- asPointwiseEstimator(essLocalDimEst, neighborhood.size=10, indices = c(1:10, 301:310)) ess.pw.res <- essPointwiseDimEst(data) ess.pw.res$dim.est
data <- swissRoll3Sph(300, 300) # the first 300 data points are on the swiss roll (ID=2) , the last 300 on the 3-sphere (ID=3) essPointwiseDimEst <- asPointwiseEstimator(essLocalDimEst, neighborhood.size=10, indices = c(1:10, 301:310)) ess.pw.res <- essPointwiseDimEst(data) ess.pw.res$dim.est
Generates a sample from a uniform distribution on a bent plane. Half of the plane is in the xz-plane and half of the plane is bent over the x-axis, so that the resulting surface has an edge along the x-axis.
cornerPlane(Ns, theta = pi/4)
cornerPlane(Ns, theta = pi/4)
Ns |
number of data points. |
theta |
angle at the x-axis. |
A Ns
x 3 matrix with columns x, y and z.
Kerstin Johnsson, Lund University
datap <- cornerPlane(400) par(mfrow = c(1, 2)) plot(datap[,1], datap[,2]) plot(datap[,1], datap[,3])
datap <- cornerPlane(400) par(mfrow = c(1, 2)) plot(datap[,1], datap[,2]) plot(datap[,1], datap[,3])
Generates Ns
data points within the unit ball from a hyperplane
through the origin with noise added. n
has to be at least d
,
otherwise the function terminates with an error.
cutHyperPlane(Ns, d, n, sd)
cutHyperPlane(Ns, d, n, sd)
Ns |
number of data points. |
d |
dimension of hyperplane. |
n |
dimension of noise. |
sd |
standard deviation of noise. |
The data set is generated the following way: First data points are sampled
uniformly in a d
-ball. After this, (n-d)
-dimensional orthogonal noise with
standard deviation sd
in each direction is added. No noise is added in the
directions parallel to the hyperplane since on an infinite plane adding
isotropic noise to a uniform distribution does not change the
distribution. Finally all data points within distance 1 from the origin are
considered as candidates for the data set that will be returned, out of the
candidates Ns
data points are chosen randomly to be returned.
If there are less than Ns
candidates more candidates will be generated
in the same way.
The data generated by this function can be used to evaluate how much local dimension estimators are affected by noise.
A Ns
x n
matrix.
If sd
is high, cutHyperPlane
will be slow and might not even
be able to return a data set. If so, it will return NULL
.
Kerstin Johnsson, Lund University
datap <- cutHyperPlane(100, 2, 3, 0.01) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
datap <- cutHyperPlane(100, 2, 3, 0.01) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
Generates Ns
data points cut out from a noisy hypersphere. n
has to be at least d+1
, otherwise the function terminates with an
error.
cutHyperSphere(Ns, rat, d, n, sd)
cutHyperSphere(Ns, rat, d, n, sd)
Ns |
number of data points. |
rat |
ratio between cut-off radius and radius of sphere. |
d |
(intrinsic) dimension of hypersphere. |
n |
dimension of noise. |
sd |
standard deviation of noise. |
The returned data are within distance rat
the point
and are obtained from a unit distribution on the
d
-sphere overlaid with n
-dimensional normal noise.
The data generated by this function can be used to evaluate the performance of local dimension estimators.
A Ns
by n
matrix.
If sd
is high, cutHyperSphere
will be slow and might not even
be able to return a data set. If so, it will return NULL
.
Kerstin Johnsson, Lund University
datap <- cutHyperSphere(100, rat = .5, 1, 3, 0.01) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3]) datap <- cutHyperSphere(100, rat = 2, 1, 3, 0.11) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
datap <- cutHyperSphere(100, rat = .5, 1, 3, 0.01) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3]) datap <- cutHyperSphere(100, rat = 2, 1, 3, 0.11) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
Intrinsic dimension estimation with the DANCo (Ceruti et al. 2012), MIND_MLi and MIND_MLk (Rozza et al. 2012) methods.
dancoDimEst(data, k, D, ver = "DANCo", calibration.data = NULL)
dancoDimEst(data, k, D, ver = "DANCo", calibration.data = NULL)
data |
a data set for which the intrinsic dimension is estimated. |
k |
neighborhood parameter. |
D |
maximal dimension. |
ver |
possible values: 'DANCo', 'MIND_MLi', 'MIND_MLk'. |
calibration.data |
precomputed calibration data. |
If cal = NULL
or the cal$maxdim < D
new calibration data will be computed as needed.
A DimEst
object with slots:
dim.est |
the intrinsic dimension estimate. |
kl.divergence |
the KL divergence between data and reference data for the estimated dimension (if ver == 'DANCo'). |
calibration.data |
calibration data that can be reused when applying DANCo to data sets of the same size with the same neighborhood parameter k. |
Kerstin Johnsson, Lund University
Ceruti, C. et al. (2012) DANCo: Dimensionality from Angle and Norm Concentration. arXiv preprint 1206.3881.
Rozza, A et al. (2012) Novel high intrinsic dimensionality estimators. Machine learning 89, 37-65.
data <- hyperBall(50, 10) res <- dancoDimEst(data, 8, 20) print(res) ## Reusing calibration data data2 <- hyperBall(50, 5) dancoDimEst(data2, 8, 20, calibration.data=res$calibration.data)
data <- hyperBall(50, 10) res <- dancoDimEst(data, 8, 20) print(res) ## Reusing calibration data data2 <- hyperBall(50, 5) dancoDimEst(data2, 8, 20, calibration.data=res$calibration.data)
Local intrinsic dimension estimation with the ESS method
essLocalDimEst(data, ver, d = 1)
essLocalDimEst(data, ver, d = 1)
data |
Local data set for which dimension should be estimated. |
ver |
Possible values: |
d |
For |
The ESS method assumes that the data is local, i.e. that it is a neighborhood taken from a larger data set, such that the curvature and the noise within the neighborhood is relatively small. In the ideal case (no noise, no curvature) this is equivalent to the data being uniformly distributed over a hyper ball.
A DimEst
object with two slots:
dim.est |
The interpolated dimension estimate. |
ess |
The ESS value produced by the algorithm. |
Kerstin Johnsson, Lund University
Johnsson, K., Soneson, C., & Fontes, M. (2015). Low Bias Local Intrinsic Dimension Estimation from Expected Simplex Skewness. IEEE Trans. Pattern Anal. Mach. Intell., 37(1), 196-202.
data <- hyperBall(100, 4, 15, .05) essLocalDimEst(data, ver = 'a', d = 1)
data <- hyperBall(100, 4, 15, .05) essLocalDimEst(data, ver = 'a', d = 1)
Reference values for the ESS dimension estimation method
essReference(ver, d, maxdim, mindim=1)
essReference(ver, d, maxdim, mindim=1)
ver |
Possible values: |
d |
For |
maxdim |
The largest dimension for which reference values should be computed. |
mindim |
The smallest dimension for which reference values should be computed. |
The ESS reference values are used by the ESS algorithm (essLocalDimEst) to compute the final dimension estimate.
A vector of length maxdim-(mindim-1), where each slot represents the reference value.
Kerstin Johnsson, Lund University
Johnsson, K., Soneson, C., & Fontes, M. (2015). Low Bias Local Intrinsic Dimension Estimation from Expected Simplex Skewness. IEEE Trans. Pattern Anal. Mach. Intell., 37(1), 196-202.
essReference('a', 3, maxdim=500) essReference('b', 1, maxdim=30, mindim=3)
essReference('a', 3, maxdim=500) essReference('b', 1, maxdim=30, mindim=3)
Generates a sample from a uniform distribution on a hypercube, the faces of a hypercube or the “edges” of a hyper cube.
hyperCube(Ns, n, side = 1) hyperCubeFaces(Ns, n) hyperCubeEdges(Ns, d, n)
hyperCube(Ns, n, side = 1) hyperCubeFaces(Ns, n) hyperCubeEdges(Ns, d, n)
Ns |
number of data points. |
d |
dimension of edges. |
n |
dimension of the hypercube. |
side |
the length of the side of the hyper cube. |
The hypercube is . The edges of dimension
d
of the
hypercube are the d
-dimensional boundaries of the hypercube. The
hypercube faces are the hyper cube edges of dimension n-1
.
A Ns
by n
matrix.
Kerstin Johnsson, Lund University.
datap <- hyperCubeEdges(200, 1, 3) par(mfrow = c(1, 3)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3]) plot(datap[, 2], datap[, 3])
datap <- hyperCubeEdges(200, 1, 3) par(mfrow = c(1, 3)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3]) plot(datap[, 2], datap[, 3])
Intrinsic dimension estimation with method given as parameter.
localIntrinsicDimension(.data, .method, ...) globalIntrinsicDimension(.data, .method, ...) pointwiseIntrinsicDimension(.data, .method, ...)
localIntrinsicDimension(.data, .method, ...) globalIntrinsicDimension(.data, .method, ...) pointwiseIntrinsicDimension(.data, .method, ...)
.data |
Data set for which dimension should be estimated. |
.method |
For |
... |
arguments passed to intrinsic dimension estimator. |
For the localIntrinsicDimension
function, .data
should be a
local data set, i.e. a piece of a data set that is well approximated by a
hyperplane (meaning that the curvature should be low in the local data set).
The function pointwiseIntrinsicDimension
estimates local dimension
around each data point in the data set.
For localIntrinsicDimension
and globalIntrinsicDimension
, a DimEst
object with the slot dim.est
containing the dimension estimate and possibly additional slots containing additional information about the estimation process.
For pointwiseIntrinsicDimension
, a DimEstPointwise
object, inheriting data.frame
, with one slot dim.est
containing the dimension estimates and possibly additional slots containing additional information about the estimation process.
Kerstin Johnsson, Lund University
Johnsson, K (2016). Structures in high-dimensional data: Intrinsic dimension and cluster analysis. PhD thesis. Lund University.
Johnsson, K., Soneson, C. and Fontes, M. (2015). Low Bias Local Intrinsic Dimension Estimation from Expected Simplex Skewness. IEEE Trans. Pattern Anal. Mach. Intell., 37(1), 196-202.
Ceruti, C. et al. (2012). DANCo: Dimensionality from Angle and Norm Concentration. arXiv preprint 1206.3881.
Rozza, A et al. (2012). Novel high intrinsic dimensionality estimators. Machine learning 89, 37-65.
Fukunaga, K. and Olsen, D. R. (1971). An algorithm for finding intrinsic dimensionality of data. IEEE Trans. Comput., c-20(2):176-183.
Fan, M. et al. (2010). Intrinsic dimension estimation of data by principal component analysis. arXiv preprint 1002.2050.
Bruske, J. and Sommer, G. (1998) Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Trans. on Pattern Anal. and Mach. Intell., 20(5), 572-575.
Haro, G., Randall, G. and Sapiro, G. (2008) Translated Poisson Mixture Model for Stratification Learning. Int. J. Comput. Vis., 80, 358-374.
Hill, B. M. (1975) A simple general approach to inference about the tail of a distribution. Ann. Stat., 3(5) 1163-1174.
Levina, E. and Bickel., P. J. (2005) Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems 17, 777-784. MIT Press.
Carter, K.M., Raich, R. and Hero, A.O. (2010) On local intrinsic dimension estimation and its applications. IEEE Trans. on Sig. Proc., 58(2), 650-663.
essLocalDimEst
, dancoDimEst
, pcaLocalDimEst
, knnDimEst
pcaOtpmPointwiseDimEst
, maxLikGlobalDimEst
, maxLikLocalDimEst
,
maxLikPointwiseDimEst
data <- hyperBall(100, 4, 15, .05) localIntrinsicDimension(data, .method='essLocalDimEst', ver = 'a', d = 1) globalIntrinsicDimension(data, 'dancoDimEst', k = 8, D = 20) pointwiseIntrinsicDimension(data, .method='maxLikPointwiseDimEst', k = 8, dnoise = NULL)
data <- hyperBall(100, 4, 15, .05) localIntrinsicDimension(data, .method='essLocalDimEst', ver = 'a', d = 1) globalIntrinsicDimension(data, 'dancoDimEst', k = 8, D = 20) pointwiseIntrinsicDimension(data, .method='maxLikPointwiseDimEst', k = 8, dnoise = NULL)
Estimates the intrinsic dimension of a data set using weighted average kNN distances.
knnDimEst(data, k, ps, M, gamma = 2)
knnDimEst(data, k, ps, M, gamma = 2)
data |
data set with each row describing a data point. |
k |
number of distances to neighbors used at a time. |
ps |
vector with sample sizes; each sample size has to be larger than
k and smaller than |
M |
number of bootstrap samples for each sample size. |
gamma |
weighting constant. |
This is a somewhat simplified version of the kNN dimension estimation method described by Carter et al. (2010), the difference being that block bootstrapping is not used.
A DimEst
object with slots:
dim.est |
the intrinsic dimension estimate (integer). |
residual |
the residual, see Carter et al. (2010). |
Kerstin Johnsson, Lund University.
Carter, K.M., Raich, R. and Hero, A.O. (2010) On local intrinsic dimension estimation and its applications. IEEE Trans. on Sig. Proc., 58(2), 650-663.
N <- 50 data <- hyperBall(N, 5) k <- 2 ps <- seq(max(k + 1, round(N/2)), N - 1, by = 3) knnDimEst(data, k, ps, M = 10, gamma = 2)
N <- 50 data <- hyperBall(N, 5) k <- 2 ps <- seq(max(k + 1, round(N/2)), N - 1, by = 3) knnDimEst(data, k, ps, M = 10, gamma = 2)
Generates data sets from Rozza et al. (2012). M14 is an 18-dimensional manifold with intrinsic dimension 72. M14 is a 24-dimensional manifold with extrinsic dimension 96. Note that M14 and M15 are not uniformly sampled.
m14Manifold(Ns) m15Manifold(Ns)
m14Manifold(Ns) m15Manifold(Ns)
Ns |
number of data points. |
A 72
-dimensional or 96
-dimensional data set respectively.
Kerstin Johnsson, Lund University.
Rozza, A. et al. (2012) Novel high intrinsic dimensionality estimators. Machine Learning, 89:37-65.
datap <- m14Manifold(800) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2]) datap <- m15Manifold(800) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2])
datap <- m14Manifold(800) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2]) datap <- m15Manifold(800) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2])
Generates a 12-dimensional manifold with extrinsic dimension 72 (not uniformly sampled).
mHeinManifold(Ns)
mHeinManifold(Ns)
Ns |
number of data points. |
A 72
-dimensional data set.
Kerstin Johnsson, Lund University.
Hein, M. and Audibert, J-Y. (2005) Intrinsic Dimensionality Estimation of Submanifolds in R^d. Proceedings of ICML, 289-296.
datap <- mHeinManifold(800) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2])
datap <- mHeinManifold(800) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2])
Get a list of neighborhoods, each containing the k
nearest neighbors (not including itself) to a point in the data set.
neighborhoods(data, k, indices, eps=0.0)
neighborhoods(data, k, indices, eps=0.0)
data |
A data set. |
k |
The number of neighbors in each neighborhood. |
indices |
A vector with indices of the points in |
eps |
If non-zero, the relative error in distance allowed when finding nearest neighbors. See Details. |
The ann
function of the package yaImpute
is used for finding the k
nearest neighbors. The eps
parameter to neighborhoods
is used in the ann
function.
A list of neighborhoods where each item corresponds to one index in indices
and each item contains a data set with k
data points.
Kerstin Johnsson, Lund University
data <- swissRoll3Sph(300, 300) neighborhoods(data, 10, 1:10)
data <- swissRoll3Sph(300, 300) neighborhoods(data, 10, 1:10)
Transition functions describing the shift in lengths of vectors
when Gaussian noise is added. Given a length
,
is the
probability density for the length after noise is added to one endpoint.
dnoiseNcChi(r, s, sigma, k) dnoiseGaussH(r, s, sigma, k)
dnoiseNcChi(r, s, sigma, k) dnoiseGaussH(r, s, sigma, k)
r |
length or vector of lengths of original vector. |
s |
length or vector of lengths of perturbed vector. |
sigma |
noise standard deviation. |
k |
dimension of noise. |
dnoiseNcChi
is the true transition function density when the noise
is Gaussian, the other transition functions are approximations of this.
dnoiseGaussH
is the Gaussian approximation used in Haro et al.
If Gaussian noise is added to both endpoints of the vector, sigma
should be replaced by
sqrt(2)*sigma
.
Vector of probability densities.
Only r
or s
can be a vector.
Kerstin Johnsson, Lund University
Haro, G., Randall, G. and Sapiro, G. (2008) Translated Poisson Mixture Model for Stratification Learning. Int. J. Comput. Vis., 80, 358-374.
maxLikPointwiseDimEst
, maxLikGlobalDimEst
, maxLikLocalDimEst
# High SNR, high-dim noise sigma <- 0.05 x <- seq(0, 1.5, length.out = 200) y <- dnoiseNcChi(x, s = .5, sigma, k = 20) plot(x, y, type = 'l', main = 'Noise dim = 20') y2 <- dnoiseGaussH(x, s = .5, sigma, k = 20) lines(x, y2, lty = 2) # Low SNR par(mfrow = c(2, 3)) sigma <- 0.2 x <- seq(0, 1.5, length.out = 200) y <- dnoiseNcChi(x, s = .5, sigma, k = 4) plot(x, y, type = 'l', main = 'Noise approximations') y2 <- dnoiseGaussH(x, s = .5, sigma, k) lines(x, y2, lty = 2) # High SNR, low-dim noise sigma <- 0.05 x <- seq(0, 1.5, length.out = 200) y <- dnoiseNcChi(x, s = .5, sigma, k = 4) plot(x, y, type = 'l', main = 'Noise dim = 4') y2 <- dnoiseGaussH(x, s = .5, sigma, k) lines(x, y2, lty = 2)
# High SNR, high-dim noise sigma <- 0.05 x <- seq(0, 1.5, length.out = 200) y <- dnoiseNcChi(x, s = .5, sigma, k = 20) plot(x, y, type = 'l', main = 'Noise dim = 20') y2 <- dnoiseGaussH(x, s = .5, sigma, k = 20) lines(x, y2, lty = 2) # Low SNR par(mfrow = c(2, 3)) sigma <- 0.2 x <- seq(0, 1.5, length.out = 200) y <- dnoiseNcChi(x, s = .5, sigma, k = 4) plot(x, y, type = 'l', main = 'Noise approximations') y2 <- dnoiseGaussH(x, s = .5, sigma, k) lines(x, y2, lty = 2) # High SNR, low-dim noise sigma <- 0.05 x <- seq(0, 1.5, length.out = 200) y <- dnoiseNcChi(x, s = .5, sigma, k = 4) plot(x, y, type = 'l', main = 'Noise dim = 4') y2 <- dnoiseGaussH(x, s = .5, sigma, k) lines(x, y2, lty = 2)
Generates a sample from a certain anisotropic normal distribution centered around the origin.
oblongNormal(Ns, n)
oblongNormal(Ns, n)
Ns |
number of data points. |
n |
dimension of the distribution (and the data points). |
In the first half of the dimensions (rounded down if n
is odd)
the standard deviation is 1 and in the rest the standard deviation is 0.25 .
A Ns
by n
matrix.
Kerstin Johnsson, Lund University
datap <- oblongNormal(100, 10) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 6])
datap <- oblongNormal(100, 10) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 6])
Estimates local manifold dimension using the largest singular values of the covariance matrix.
pcaLocalDimEst(data, ver, alphaFO = .05, alphaFan = 10, betaFan = .8, PFan = .95, ngap = 5, maxdim = min(dim(data)), verbose = TRUE)
pcaLocalDimEst(data, ver, alphaFO = .05, alphaFan = 10, betaFan = .8, PFan = .95, ngap = 5, maxdim = min(dim(data)), verbose = TRUE)
data |
a local data set for which dimension should be estimated. |
ver |
possible values: |
alphaFO |
only for |
alphaFan |
only for |
betaFan |
only for |
PFan |
only for |
ngap |
only for |
maxdim |
only for |
verbose |
should information about the process be printed out? |
Version 'FO'
is the method by Fukunaga-Olsen, version 'fan'
is the method by Fan et al..
Version 'maxgap'
returns the position of the largest relative gap in the sequence of singular values.
Version 'cal'
considers the positions of the ngap
largest relative gaps in the
sequence of singular values and generates calibration data to determine which one of them is most likely.
All versions assume that the data is local, i.e. that it is a neighborhood taken from a larger data set, such that the curvature and the noise within the neighborhood is relatively small. In the ideal case (no noise, no curvature) this is equivalent to the data being uniformly distributed over a hyper ball.
A DimEst
object with slots:
dim.est |
the dimension estimate |
gap.size |
if |
likelihood |
if |
Kerstin Johnsson, Lund University
Fukunaga, K. and Olsen, D. R. (1971). An algorithm for finding intrinsic dimensionality of data. IEEE Trans. Comput., c-20(2):176-183.
Fan, M. et al. (2010). Intrinsic dimension estimation of data by principal component analysis. arXiv preprint 1002.2050.
data <- cutHyperPlane(100, 4, 10, .05) pcaLocalDimEst(data, 'fan') pcaLocalDimEst(data, 'FO') pcaLocalDimEst(data, 'maxgap')
data <- cutHyperPlane(100, 4, 10, .05) pcaLocalDimEst(data, 'fan') pcaLocalDimEst(data, 'FO') pcaLocalDimEst(data, 'maxgap')
Intrinsic dimension estimation with the method proposed in Bruske and Sommer (1998). A graph called optimally topology preserving map (OTPM) is constructed and on this local PCA is made with the Fukunaga-Olsen criterion to determine which eigenvalues that are significant.
pcaOtpmPointwiseDimEst(data, N, alpha = .05)
pcaOtpmPointwiseDimEst(data, N, alpha = .05)
data |
a data set for which dimension should be estimated. |
N |
the number of the nodes in the OTPM. |
alpha |
the significance level for the Fukunaga-Olsen method. |
A DimEstPointwise
object, inheriting data.frame
, with two columns:
dim.est |
The dimension estimate at each point. |
nbr.nb |
The number of neighboring nodes used for the dimension estimate at each point. |
Kerstin Johnsson, Lund University
Bruske, J. and Sommer, G. (1998) Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Trans. on Pattern Anal. and Mach. Intell., 20(5), 572-575.
data <- hyperBall(1000, 5) pcaOtpmPointwiseDimEst(data, 400)
data <- hyperBall(1000, 5) pcaOtpmPointwiseDimEst(data, 400)
Generates a sample from isotropic distributions in d
dimensions with
n
-dimensional noise added to it.
hyperBall(Ns, d, n = d, sd = 0) hyperSphere(Ns, d, n = d + 1, sd = 0) isotropicNormal(Ns, d, n = d, sd = 0)
hyperBall(Ns, d, n = d, sd = 0) hyperSphere(Ns, d, n = d + 1, sd = 0) isotropicNormal(Ns, d, n = d, sd = 0)
Ns |
number of points. |
d |
intrinsic dimension of the support of the distribution (the manifold.) |
n |
dimension of noise. |
sd |
standard deviation of noise. |
hyperBall
draws a sample from a uniform distribution on a hyper ball of
radius 1.
hyperSphere
draws a sample from a uniform distribution on a hypersphere
of radius 1.
isotropicNormal
draws a sample from a isotropic normal distribution with
identity covariance matrix.
Kerstin Johnsson, Lund University
datap <- hyperSphere(100, 1, 3, sd = .1) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
datap <- hyperSphere(100, 1, 3, sd = .1) par(mfrow = c(1, 2)) plot(datap[, 1], datap[, 2]) plot(datap[, 1], datap[, 3])
Generates a sample from a uniform distribution on a Swiss roll-surface, possibly together with a sample from a uniform distribution on a 3-sphere inside the Swiss roll.
swissRoll(Ns, a = 1, b = 2, nturn = 1.5, h = 4) swissRoll3Sph(Ns, Nsph, a = 1, b = 2, nturn = 1.5, h = 4)
swissRoll(Ns, a = 1, b = 2, nturn = 1.5, h = 4) swissRoll3Sph(Ns, Nsph, a = 1, b = 2, nturn = 1.5, h = 4)
Ns |
number of data points on the Swiss roll. |
Nsph |
number of data points on the 3-sphere. |
a |
minimal radius of Swiss roll and radius of 3-sphere. |
b |
maximal radius of Swiss roll. |
nturn |
number of turns of the surface. |
h |
height of Swiss roll. |
swissRoll
returns three-dimensional data points.
swissRoll3Sph
returns four-dimensional data points with the Swiss roll
in the three first dimensions (columns). The Ns
first data points
lie on the Swiss roll and the Nsph
last data points lie on the
3-sphere.
Kerstin Johnsson, Lund University.
datap <- swissRoll3Sph(300, 100) par(mfrow = c(1, 3)) plot(datap[,1], datap[,2]) plot(datap[,1], datap[,3]) plot(datap[,1], datap[,4])
datap <- swissRoll3Sph(300, 100) par(mfrow = c(1, 3)) plot(datap[,1], datap[,2]) plot(datap[,1], datap[,3]) plot(datap[,1], datap[,4])
Estimates the intrinsic dimension of a data set using models of translated Poisson distributions.
maxLikGlobalDimEst(data, k, dnoise = NULL, sigma = 0, n = NULL, integral.approximation = 'Haro', unbiased = FALSE, neighborhood.based = TRUE, neighborhood.aggregation = 'maximum.likelihood', iterations = 5, K = 5) maxLikPointwiseDimEst(data, k, dnoise = NULL, sigma = 0, n = NULL, indices = NULL, integral.approximation = 'Haro', unbiased = FALSE, iterations = 5) maxLikLocalDimEst(data, dnoise = NULL, sigma = 0, n = NULL, integral.approximation = 'Haro', unbiased = FALSE, iterations = 5)
maxLikGlobalDimEst(data, k, dnoise = NULL, sigma = 0, n = NULL, integral.approximation = 'Haro', unbiased = FALSE, neighborhood.based = TRUE, neighborhood.aggregation = 'maximum.likelihood', iterations = 5, K = 5) maxLikPointwiseDimEst(data, k, dnoise = NULL, sigma = 0, n = NULL, indices = NULL, integral.approximation = 'Haro', unbiased = FALSE, iterations = 5) maxLikLocalDimEst(data, dnoise = NULL, sigma = 0, n = NULL, integral.approximation = 'Haro', unbiased = FALSE, iterations = 5)
data |
data set with each row describing a data point. |
k |
the number of distances that should be used for each dimension estimation. |
dnoise |
a function or a name of a function giving the translation density. If NULL, no noise is modeled, and the estimator turns into the Hill estimator (see References). Translation densities |
sigma |
(estimated) standard deviation of the (isotropic) noise. |
n |
dimension of the noise. |
indices |
the indices of the data points for which local dimension estimation should be made. |
integral.approximation |
how to approximate the integrals in eq. (5) in Haro et al. (2008). Possible values: |
unbiased |
if |
neighborhood.based |
if TRUE, dimension estimation is first made for neighborhoods around each data point and final value is aggregated from this. Otherwise dimension estimation is made once, based on distances in entire data set. |
neighborhood.aggregation |
if |
iterations |
for |
K |
for |
The estimators are based on the referenced paper by Haro et al. (2008), using the assumption that there is a single manifold. The estimator in the paper is obtained using default parameters and dnoise = dnoiseGaussH
.
With integral.approximation = 'Haro'
the Taylor expansion approximation of r^(m-1)
that Haro et al. (2008) used are employed. With integral.approximation = 'guaranteed.convergence'
, r
is factored out and kept and r^(m-2)
is approximated with the corresponding Taylor expansion. This guarantees convergence of the integrals. Divergence might be an issue when the noise is not sufficiently small in comparison to the smallest distances. With integral.approximation = 'iteration'
, five iterations is used to determine m
.
maxLikLocalDimEst
assumes that the data set is local i.e. a piece of a data set cut out by a sphere with a radius such that the data set is well approximated by a hyperplane (meaning that the curvature should be low in the local data set). See localIntrinsicDimension
.
For maxLikGlobalDimEst
and maxLikLocalDimEst
, a DimEst
object with one slot:
dim.est |
the dimension estimate |
For maxLikPointwiseDimEst
, a DimEstPointwise
object, inheriting data.frame
, with one slot:
dim.est |
the dimension estimate for each data point. Row |
Kerstin Johnsson, Lund University.
Haro, G., Randall, G. and Sapiro, G. (2008) Translated Poisson Mixture Model for Stratification Learning. Int. J. Comput. Vis., 80, 358-374.
Hill, B. M. (1975) A simple general approach to inference about the tail of a distribution. Ann. Stat., 3(5) 1163-1174.
Levina, E. and Bickel., P. J. (2005) Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems 17, 777-784. MIT Press.
data <- hyperBall(100, d = 7, n = 13, sd = 0.01) maxLikGlobalDimEst(data, 10, dnoiseNcChi, 0.01, 13) maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13) maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13) maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13, neighborhood.aggregation = 'robust') maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13, integral.approximation = 'guaranteed.convergence', neighborhood.aggregation = 'robust') maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13, integral.approximation = 'iteration', unbiased = TRUE) data <- hyperBall(1000, d = 7, n = 13, sd = 0.01) maxLikGlobalDimEst(data, 500, dnoiseGaussH, 0.01, 13, neighborhood.based = FALSE) maxLikGlobalDimEst(data, 500, dnoiseGaussH, 0.01, 13, integral.approximation = 'guaranteed.convergence', neighborhood.based = FALSE) maxLikGlobalDimEst(data, 500, dnoiseGaussH, 0.01, 13, integral.approximation = 'iteration', neighborhood.based = FALSE) data <- hyperBall(100, d = 7, n = 13, sd = 0.01) maxLikPointwiseDimEst(data, 10, dnoiseNcChi, 0.01, 13, indices=1:10) data <- cutHyperPlane(50, d = 7, n = 13, sd = 0.01) maxLikLocalDimEst(data, dnoiseNcChi, 0.1, 3) maxLikLocalDimEst(data, dnoiseGaussH, 0.1, 3) maxLikLocalDimEst(data, dnoiseNcChi, 0.1, 3, integral.approximation = 'guaranteed.convergence')
data <- hyperBall(100, d = 7, n = 13, sd = 0.01) maxLikGlobalDimEst(data, 10, dnoiseNcChi, 0.01, 13) maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13) maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13) maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13, neighborhood.aggregation = 'robust') maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13, integral.approximation = 'guaranteed.convergence', neighborhood.aggregation = 'robust') maxLikGlobalDimEst(data, 10, dnoiseGaussH, 0.01, 13, integral.approximation = 'iteration', unbiased = TRUE) data <- hyperBall(1000, d = 7, n = 13, sd = 0.01) maxLikGlobalDimEst(data, 500, dnoiseGaussH, 0.01, 13, neighborhood.based = FALSE) maxLikGlobalDimEst(data, 500, dnoiseGaussH, 0.01, 13, integral.approximation = 'guaranteed.convergence', neighborhood.based = FALSE) maxLikGlobalDimEst(data, 500, dnoiseGaussH, 0.01, 13, integral.approximation = 'iteration', neighborhood.based = FALSE) data <- hyperBall(100, d = 7, n = 13, sd = 0.01) maxLikPointwiseDimEst(data, 10, dnoiseNcChi, 0.01, 13, indices=1:10) data <- cutHyperPlane(50, d = 7, n = 13, sd = 0.01) maxLikLocalDimEst(data, dnoiseNcChi, 0.1, 3) maxLikLocalDimEst(data, dnoiseGaussH, 0.1, 3) maxLikLocalDimEst(data, dnoiseNcChi, 0.1, 3, integral.approximation = 'guaranteed.convergence')
Generates data points from a two- or higher-dimensional Twin Peaks manifold.
twinPeaks(Ns, h = 1) hyperTwinPeaks(Ns, n, h = 1)
twinPeaks(Ns, h = 1) hyperTwinPeaks(Ns, n, h = 1)
Ns |
number of data points. |
n |
dimension of the (hyper) plane from which the peaks stand out. For
|
h |
height of the peaks. |
The height of the points is computed as , where
are the coordinates of the point in
the (hyper) plane.
A n+1
-dimensional data set, where the last dimension represents the
height of the points.
Kerstin Johnsson, Lund University.
datap <- twinPeaks(400) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2])
datap <- twinPeaks(400) par(mfrow = c(1, 3)) plot(datap[,1], datap[,3]) plot(datap[,2], datap[,3]) plot(datap[,1], datap[,2])