Related papers: Minimal Random Code Learning with Mean-KL Parameterization

Minimal Random Code Learning with Mean-KL Parameterization

URL: http://arxiv.org/abs/2307.07816v2
Date: Mon, 4 Dec 2023 10:08:57 GMT
Title: Minimal Random Code Learning with Mean-KL Parameterization
Authors: Jihao Andreas Lin, Gergely Flamich, Jos\'e Miguel Hern\'andez-Lobato
Abstract summary: We study two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_mathbfw$ and uses relative entropy coding to compress a weight sample from the posterior. We show that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression.
Score: 2.3814052021083354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for $Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning.

Related papers

New Bounds for Sparse Variational Gaussian Processes [8.122270502556374]
Sparse variational Gaussian processes (GPs) construct tractable posterior approximations to GP models. At the core of these methods is the assumption that the true posterior distribution over training function values $bf f$ and inducing variables $bf u$ is approximated by a variational distribution that incorporates the conditional GP prior $p(bf f | bf u)$ in its factorization. We show that for model training we can relax it through the use of a more general variational distribution $q(bf f | bf u)$
arXiv Detail & Related papers (2025-02-12T19:04:26Z)
Variance Reduction for the Independent Metropolis Sampler [11.074080383657453]
We prove that if $pi$ is close enough under KL divergence to another density $q$, an independent sampler that obtains samples from $pi$ achieves smaller variance than i.i.d. sampling from $pi$. We propose an adaptive independent Metropolis algorithm that adapts the proposal density such that its KL divergence with the target is being reduced.
arXiv Detail & Related papers (2024-06-25T16:38:53Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Semidefinite programming relaxations and debiasing for MAXCUT-based clustering [1.9761774213809036]
We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of 2 sub-gaussian distributions in $mathbbRp$. We use semidefinite programming relaxations of an integer quadratic program that is formulated as finding the maximum cut on a graph.
arXiv Detail & Related papers (2024-01-16T03:14:24Z)
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum [7.095058159492494]
heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over iterations of gradient descent. We show that SHB can attain an accelerated acceleration when the mini-size is larger than a threshold $b* that that on the number $kappa.
arXiv Detail & Related papers (2024-01-12T18:17:28Z)
Accurate and Scalable Stochastic Gaussian Process Regression via Learnable Coreset-based Variational Inference [8.077736581030264]
We introduce a novel inductive variational inference method for Gaussian process ($mathcalGP$) regression, by deriving a posterior over a learnable set of coresets. Unlike former free-form variational families for inference, our coreset-based variational $mathcalGP$ (CVGP) is defined in terms of the $mathcalGP$ prior and the (weighted) data likelihood.
arXiv Detail & Related papers (2023-11-02T17:22:22Z)
Compressed and distributed least-squares regression: convergence rates with applications to Federated Learning [9.31522898261934]
We investigate the impact of compression on gradient algorithms for machine learning. We highlight differences in terms of convergence rates between several unbiased compression operators. We extend our results to the case of federated learning.
arXiv Detail & Related papers (2023-08-02T18:02:00Z)
Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation [41.082982732100696]
We study a minibatch variant of proximal point (SPP) methods, namely M-SPP, for solving convex composite risk minimization problems. We show that M-SPP with minibatch-size $n$ and quadratic count $T$ enjoys an in-expectation fast rate of convergence. In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches.
arXiv Detail & Related papers (2023-01-09T00:13:34Z)
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression. We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective. Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z)
Permutation Compressors for Provably Faster Distributed Nonconvex Optimization [68.8204255655161]
We show that the MARINA method of Gorbunov et al (2021) can be considered as a state-of-the-art method in terms of theoretical communication complexity. Theory of MARINA to support the theory of potentially em correlated compressors, extends to the method beyond the classical independent compressors setting.
arXiv Detail & Related papers (2021-10-07T09:38:15Z)
Convergence of Sparse Variational Inference in Gaussian Processes Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.