The Shrinkage-Delinkage Trade-off: An Analysis of Factorized Gaussian
Approximations for Variational Inference
- URL: http://arxiv.org/abs/2302.09163v3
- Date: Fri, 26 May 2023 17:58:48 GMT
- Title: The Shrinkage-Delinkage Trade-off: An Analysis of Factorized Gaussian
Approximations for Variational Inference
- Authors: Charles C. Margossian and Lawrence K. Saul
- Abstract summary: We consider two popular ways to measure the uncertainty deficit of variational inference (VI)
We prove that $q$ always underestimates both the componentwise variance and the entropy of $p$.
We study various manifestations of this trade-off, notably one where, as the dimension of the problem grows, the per-component entropy gap becomes vanishingly small.
- Score: 3.167685495996986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When factorized approximations are used for variational inference (VI), they
tend to underestimate the uncertainty -- as measured in various ways -- of the
distributions they are meant to approximate. We consider two popular ways to
measure the uncertainty deficit of VI: (i) the degree to which it
underestimates the componentwise variance, and (ii) the degree to which it
underestimates the entropy. To better understand these effects, and the
relationship between them, we examine an informative setting where they can be
explicitly (and elegantly) analyzed: the approximation of a Gaussian,~$p$, with
a dense covariance matrix, by a Gaussian,~$q$, with a diagonal covariance
matrix. We prove that $q$ always underestimates both the componentwise variance
and the entropy of $p$, \textit{though not necessarily to the same degree}.
Moreover we demonstrate that the entropy of $q$ is determined by the trade-off
of two competing forces: it is decreased by the shrinkage of its componentwise
variances (our first measure of uncertainty) but it is increased by the
factorized approximation which delinks the nodes in the graphical model of $p$.
We study various manifestations of this trade-off, notably one where, as the
dimension of the problem grows, the per-component entropy gap between $p$ and
$q$ becomes vanishingly small even though $q$ underestimates every
componentwise variance by a constant multiplicative factor. We also use the
shrinkage-delinkage trade-off to bound the entropy gap in terms of the problem
dimension and the condition number of the correlation matrix of $p$. Finally we
present empirical results on both Gaussian and non-Gaussian targets, the former
to validate our analysis and the latter to explore its limitations.
Related papers
- Causal vs. Anticausal merging of predictors [57.26526031579287]
We study the differences arising from merging predictors in the causal and anticausal directions using the same data.
We use Causal Maximum Entropy (CMAXENT) as inductive bias to merge the predictors.
arXiv Detail & Related papers (2025-01-14T20:38:15Z) - A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA)
In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z) - Convergence of Unadjusted Langevin in High Dimensions: Delocalization of Bias [13.642712817536072]
We show that as the dimension $d$ of the problem increases, the number of iterations required to ensure convergence within a desired error increases.
A key technical challenge we address is the lack of a one-step contraction property in the $W_2,ellinfty$ metric to measure convergence.
arXiv Detail & Related papers (2024-08-20T01:24:54Z) - Theoretical Guarantees for Variational Inference with Fixed-Variance Mixture of Gaussians [27.20127082606962]
Variational inference (VI) is a popular approach in Bayesian inference.
This work aims to contribute to the theoretical study of VI in the non-Gaussian case.
arXiv Detail & Related papers (2024-06-06T12:38:59Z) - Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs [10.075911116030621]
We show that if $p$ does not factorize, then any factorized approximation $qin Q$ can correctly estimate at most one of the following three measures of uncertainty.
We consider the classic Kullback-Leibler divergences, the more general R'enyi divergences, and a score-based divergence which compares $nabla log p$ and $nabla log q$.
arXiv Detail & Related papers (2024-03-20T16:56:08Z) - TIC-TAC: A Framework for Improved Covariance Estimation in Deep Heteroscedastic Regression [109.69084997173196]
Deepscedastic regression involves jointly optimizing the mean and covariance of the predicted distribution using the negative log-likelihood.
Recent works show that this may result in sub-optimal convergence due to the challenges associated with covariance estimation.
We study two questions: (1) Does the predicted covariance truly capture the randomness of the predicted mean?
Our results show that not only does TIC accurately learn the covariance, it additionally facilitates an improved convergence of the negative log-likelihood.
arXiv Detail & Related papers (2023-10-29T09:54:03Z) - Compressed and distributed least-squares regression: convergence rates
with applications to Federated Learning [9.31522898261934]
We investigate the impact of compression on gradient algorithms for machine learning.
We highlight differences in terms of convergence rates between several unbiased compression operators.
We extend our results to the case of federated learning.
arXiv Detail & Related papers (2023-08-02T18:02:00Z) - Universality in Anderson localization on random graphs with varying
connectivity [0.0]
We show that there should be a non-ergodic region above a given value of disorder $W_E$.
Although no separate $W_E$ exists from $W_C$, the length scale at which fully developed ergodicity is found diverges like $|W-W_C|-1$.
The separation of these two scales at the critical point allows for a true non-ergodic, delocalized region.
arXiv Detail & Related papers (2022-05-29T09:47:39Z) - Equivariance Discovery by Learned Parameter-Sharing [153.41877129746223]
We study how to discover interpretable equivariances from data.
Specifically, we formulate this discovery process as an optimization problem over a model's parameter-sharing schemes.
Also, we theoretically analyze the method for Gaussian data and provide a bound on the mean squared gap between the studied discovery scheme and the oracle scheme.
arXiv Detail & Related papers (2022-04-07T17:59:19Z) - Fundamental Limits and Tradeoffs in Invariant Representation Learning [99.2368462915979]
Many machine learning applications involve learning representations that achieve two competing goals.
Minimax game-theoretic formulation represents a fundamental tradeoff between accuracy and invariance.
We provide an information-theoretic analysis of this general and important problem under both classification and regression settings.
arXiv Detail & Related papers (2020-12-19T15:24:04Z) - Robustly Learning any Clusterable Mixture of Gaussians [55.41573600814391]
We study the efficient learnability of high-dimensional Gaussian mixtures in the adversarial-robust setting.
We provide an algorithm that learns the components of an $epsilon$-corrupted $k$-mixture within information theoretically near-optimal error proofs of $tildeO(epsilon)$.
Our main technical contribution is a new robust identifiability proof clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system.
arXiv Detail & Related papers (2020-05-13T16:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.