Confidence-Optimal Random Embeddings
- URL: http://arxiv.org/abs/2104.05628v1
- Date: Tue, 6 Apr 2021 18:00:02 GMT
- Title: Confidence-Optimal Random Embeddings
- Authors: Maciej Skorski
- Abstract summary: This paper develops Johnson-Lindenstrauss distributions with optimal, data-oblivious, statistical confidence bounds.
The bounds are numerically best possible, for any given data dimension, embedding, and distortion tolerance.
They improve upon prior works in terms of statistical accuracy, as well as exactly determine the no-go regimes for data-oblivious approaches.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The seminal result of Johnson and Lindenstrauss on random embeddings has been
intensively studied in applied and theoretical computer science. Despite that
vast body of literature, we still lack of complete understanding of statistical
properties of random projections; a particularly intriguing question is: why
are the theoretical bounds that far behind the empirically observed
performance? Motivated by this question, this work develops
Johnson-Lindenstrauss distributions with optimal, data-oblivious, statistical
confidence bounds. These bounds are numerically best possible, for any given
data dimension, embedding dimension, and distortion tolerance. They improve
upon prior works in terms of statistical accuracy, as well as exactly determine
the no-go regimes for data-oblivious approaches. Furthermore, the corresponding
projection matrices are efficiently samplable. The construction relies on
orthogonal matrices, and the proof uses certain elegant properties of the unit
sphere. The following techniques introduced in this work are of independent
interest: a) a compact expression for distortion in terms of singular
eigenvalues of the projection matrix, b) a parametrization linking the unit
sphere and the Dirichlet distribution and c) anti-concentration bounds for the
Dirichlet distribution.
Besides the technical contribution, the paper presents applications and
numerical evaluation along with working implementation in Python.
Related papers
- Node Similarities under Random Projections: Limits and Pathological Cases [9.452274776651494]
We investigate how well dot product and cosine similarity are preserved by random projections.
We specialize our fundamental results to a ranking application by computing the probability of random projections flipping the node ordering induced by their embeddings.
With respect to the statistical noise introduced by random projections, we show that cosine similarity produces remarkably more precise approximations.
arXiv Detail & Related papers (2024-04-15T21:35:25Z) - Distributed Semi-Supervised Sparse Statistical Inference [6.685997976921953]
A debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters.
Traditional methods require computing a debiased estimator on every machine.
An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed.
arXiv Detail & Related papers (2023-06-17T17:30:43Z) - Learning Graphical Factor Models with Riemannian Optimization [70.13748170371889]
This paper proposes a flexible algorithmic framework for graph learning under low-rank structural constraints.
The problem is expressed as penalized maximum likelihood estimation of an elliptical distribution.
We leverage geometries of positive definite matrices and positive semi-definite matrices of fixed rank that are well suited to elliptical models.
arXiv Detail & Related papers (2022-10-21T13:19:45Z) - Data-Driven Influence Functions for Optimization-Based Causal Inference [105.5385525290466]
We study a constructive algorithm that approximates Gateaux derivatives for statistical functionals by finite differencing.
We study the case where probability distributions are not known a priori but need to be estimated from data.
arXiv Detail & Related papers (2022-08-29T16:16:22Z) - Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression.
It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise.
This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z) - Efficient Multidimensional Functional Data Analysis Using Marginal
Product Basis Systems [2.4554686192257424]
We propose a framework for learning continuous representations from a sample of multidimensional functional data.
We show that the resulting estimation problem can be solved efficiently by the tensor decomposition.
We conclude with a real data application in neuroimaging.
arXiv Detail & Related papers (2021-07-30T16:02:15Z) - Random Embeddings with Optimal Accuracy [0.0]
This work constructs Jonson-Lindenstrauss embeddings with best accuracy, as measured by variance, mean-squared error and exponential length distortion.
arXiv Detail & Related papers (2020-12-31T19:00:31Z) - General stochastic separation theorems with optimal bounds [68.8204255655161]
Phenomenon of separability was revealed and used in machine learning to correct errors of Artificial Intelligence (AI) systems and analyze AI instabilities.
Errors or clusters of errors can be separated from the rest of the data.
The ability to correct an AI system also opens up the possibility of an attack on it, and the high dimensionality induces vulnerabilities caused by the same separability.
arXiv Detail & Related papers (2020-10-11T13:12:41Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Semiparametric Nonlinear Bipartite Graph Representation Learning with
Provable Guarantees [106.91654068632882]
We consider the bipartite graph and formalize its representation learning problem as a statistical estimation problem of parameters in a semiparametric exponential family distribution.
We show that the proposed objective is strongly convex in a neighborhood around the ground truth, so that a gradient descent-based method achieves linear convergence rate.
Our estimator is robust to any model misspecification within the exponential family, which is validated in extensive experiments.
arXiv Detail & Related papers (2020-03-02T16:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.