Minimal Random Code Learning with Mean-KL Parameterization
- URL: http://arxiv.org/abs/2307.07816v2
- Date: Mon, 4 Dec 2023 10:08:57 GMT
- Title: Minimal Random Code Learning with Mean-KL Parameterization
- Authors: Jihao Andreas Lin, Gergely Flamich, Jos\'e Miguel Hern\'andez-Lobato
- Abstract summary: We study two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks.
MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_mathbfw$ and uses relative entropy coding to compress a weight sample from the posterior.
We show that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression.
- Score: 2.3814052021083354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the qualitative behavior and robustness of two variants of
Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian
neural networks. MIRACLE implements a powerful, conditionally Gaussian
variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses
relative entropy coding to compress a weight sample from the posterior using a
Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired
compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must
be constrained, which requires a computationally expensive annealing procedure
under the conventional mean-variance (Mean-Var) parameterization for
$Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL
divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the
desired value by construction. We demonstrate that variational training with
Mean-KL parameterization converges twice as fast and maintains predictive
performance after compression. Furthermore, we show that Mean-KL leads to more
meaningful variational distributions with heavier tails and compressed weight
samples which are more robust to pruning.
Related papers
- Variance Reduction for the Independent Metropolis Sampler [11.074080383657453]
We prove that if $pi$ is close enough under KL divergence to another density $q$, an independent sampler that obtains samples from $pi$ achieves smaller variance than i.i.d. sampling from $pi$.
We propose an adaptive independent Metropolis algorithm that adapts the proposal density such that its KL divergence with the target is being reduced.
arXiv Detail & Related papers (2024-06-25T16:38:53Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - (Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum [7.095058159492494]
heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over iterations of gradient descent.
We show that SHB can attain an accelerated acceleration when the mini-size is larger than a threshold $b* that that on the number $kappa.
arXiv Detail & Related papers (2024-01-12T18:17:28Z) - Compressed and distributed least-squares regression: convergence rates
with applications to Federated Learning [9.31522898261934]
We investigate the impact of compression on gradient algorithms for machine learning.
We highlight differences in terms of convergence rates between several unbiased compression operators.
We extend our results to the case of federated learning.
arXiv Detail & Related papers (2023-08-02T18:02:00Z) - Sharper Analysis for Minibatch Stochastic Proximal Point Methods:
Stability, Smoothness, and Deviation [41.082982732100696]
We study a minibatch variant of proximal point (SPP) methods, namely M-SPP, for solving convex composite risk minimization problems.
We show that M-SPP with minibatch-size $n$ and quadratic count $T$ enjoys an in-expectation fast rate of convergence.
In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches.
arXiv Detail & Related papers (2023-01-09T00:13:34Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - Permutation Compressors for Provably Faster Distributed Nonconvex
Optimization [68.8204255655161]
We show that the MARINA method of Gorbunov et al (2021) can be considered as a state-of-the-art method in terms of theoretical communication complexity.
Theory of MARINA to support the theory of potentially em correlated compressors, extends to the method beyond the classical independent compressors setting.
arXiv Detail & Related papers (2021-10-07T09:38:15Z) - Convergence of Sparse Variational Inference in Gaussian Processes
Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings.
We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.