Related papers: Posterior Approximation using Stochastic Gradient Ascent with Adaptive Stepsize

Posterior Approximation using Stochastic Gradient Ascent with Adaptive Stepsize

URL: http://arxiv.org/abs/2412.08951v2
Date: Sat, 22 Feb 2025 03:49:24 GMT
Title: Posterior Approximation using Stochastic Gradient Ascent with Adaptive Stepsize
Authors: Kart-Leong Lim, Xudong Jiang,
Abstract summary: posterior approximation allow nonparametrics such as Dirichlet process mixture to scale up to larger dataset at fractional cost.<n> gradient ascent is a modern approach to machine learning and is widely deployed in the training of deep neural networks.<n>In this work, we explore using gradient ascent as a fast algorithm for the posterior approximation of Dirichlet process mixture.
Score: 24.464140786923476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scalable algorithms of posterior approximation allow Bayesian nonparametrics such as Dirichlet process mixture to scale up to larger dataset at fractional cost. Recent algorithms, notably the stochastic variational inference performs local learning from minibatch. The main problem with stochastic variational inference is that it relies on closed form solution. Stochastic gradient ascent is a modern approach to machine learning and is widely deployed in the training of deep neural networks. In this work, we explore using stochastic gradient ascent as a fast algorithm for the posterior approximation of Dirichlet process mixture. However, stochastic gradient ascent alone is not optimal for learning. In order to achieve both speed and performance, we turn our focus to stepsize optimization in stochastic gradient ascent. As as intermediate approach, we first optimize stepsize using the momentum method. Finally, we introduce Fisher information to allow adaptive stepsize in our posterior approximation. In the experiments, we justify that our approach using stochastic gradient ascent do not sacrifice performance for speed when compared to closed form coordinate ascent learning on these datasets. Lastly, our approach is also compatible with deep ConvNet features as well as scalable to large class datasets such as Caltech256 and SUN397.

Related papers

Adaptive Consensus Gradients Aggregation for Scaled Distributed Training [6.234802839923543]
We analyze the distributed gradient aggregation process through the lens of subspace optimization. Our method demonstrates improved performance over the ubiquitous averaging on multiple tasks while remaining extremely efficient in both communicational and computational complexity.
arXiv Detail & Related papers (2024-11-06T08:16:39Z)
Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement [29.675650285351768]
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks. Approximate MU is a practical method for large-scale models. We propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction.
arXiv Detail & Related papers (2024-09-29T15:17:33Z)
Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling [22.256068524699472]
In this work, we propose an Annealed Importance Sampling (AIS) approach to address these issues. We combine the strengths of Sequential Monte Carlo samplers and VI to explore a wider range of posterior distributions and gradually approach the target distribution. Experimental results on both toy and image datasets demonstrate that our method outperforms state-of-the-art methods in terms of tighter variational bounds, higher log-likelihoods, and more robust convergence.
arXiv Detail & Related papers (2024-08-13T08:09:05Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Information-Theoretic Trust Regions for Stochastic Gradient-Based Optimization [17.79206971486723]
We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD. We approximate the diagonal elements of the Hessian from gradients, constructing a model of the expected Hessian over time using only first-order information. We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD.
arXiv Detail & Related papers (2023-10-31T16:08:38Z)
PAVI: Plate-Amortized Variational Inference [55.975832957404556]
Inference is challenging for large population studies where millions of measurements are performed over a cohort of hundreds of subjects. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical. In this work, we design structured VI families that efficiently tackle large population studies.
arXiv Detail & Related papers (2023-08-30T13:22:20Z)
Manifold Gaussian Variational Bayes on the Precision Matrix [70.44024861252554]
We propose an optimization algorithm for Variational Inference (VI) in complex models. We develop an efficient algorithm for Gaussian Variational Inference whose updates satisfy the positive definite constraint on the variational covariance matrix. Due to its black-box nature, MGVBP stands as a ready-to-use solution for VI in complex models.
arXiv Detail & Related papers (2022-10-26T10:12:31Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Rényi Divergence Deep Mutual Learning [3.682680183777648]
This paper revisits Deep Learning Mutual (DML) as a simple yet effective computing paradigm. We propose using R'enyi divergence instead of the KL divergence, which is more flexible and limited. Our empirical results demonstrate the advantage combining DML and R'enyi divergence, leading to further improvement in model generalization.
arXiv Detail & Related papers (2022-09-13T04:58:35Z)
Generalised Gaussian Process Latent Variable Models (GPLVM) with Stochastic Variational Inference [9.468270453795409]
We study the doubly formulation of the BayesianVM model amenable with minibatch training. We show how this framework is compatible with different latent variable formulations and perform experiments to compare a suite of models. We demonstrate how we can train in the presence of massively missing data and obtain high-fidelity reconstructions.
arXiv Detail & Related papers (2022-02-25T21:21:51Z)
Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size [29.15132344744801]
We establish local convergence for gradient descent with adaptive step size for problems such as matrix inversion. We show that these first order optimization methods can achieve sub-linear or linear convergence.
arXiv Detail & Related papers (2021-12-30T00:50:30Z)
Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances. Online descent (OGD) is a popular approach to handle streaming data in pairwise learning. In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z)
Learning Augmentation Distributions using Transformed Risk Minimization [47.236227685707526]
We propose a new emphTransformed Risk Minimization (TRM) framework as an extension of classical risk minimization. As a key application, we focus on learning augmentations to improve classification performance with a given class of predictors.
arXiv Detail & Related papers (2021-11-16T02:07:20Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Stochastic Gradient Variance Reduction by Solving a Filtering Problem [0.951828574518325]
Deep neural networks (DNN) are typically using optimized gradient descent (SGD) The estimation of the gradient using samples tends to be noisy and unreliable, resulting in large gradient variance and bad convergence. We propose textbfFilter Gradient Decent(FGD), an efficient optimization algorithm that makes the consistent estimation of gradient.
arXiv Detail & Related papers (2020-12-22T23:48:42Z)
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics. We prove that our model-based procedure converges in noisy gradient setting. This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning [32.59760685342343]
Probabilistic Latent Variable Models provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. In this work, we propose ConvDMM, a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods.
arXiv Detail & Related papers (2020-06-03T21:50:20Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.