Improving Discrete Optimisation Via Decoupled Straight-Through Gumbel-Softmax
- URL: http://arxiv.org/abs/2410.13331v1
- Date: Thu, 17 Oct 2024 08:44:57 GMT
- Title: Improving Discrete Optimisation Via Decoupled Straight-Through Gumbel-Softmax
- Authors: Rushi Shah, Mingyuan Yan, Michael Curtis Mozer, Dianbo Liu,
- Abstract summary: We show that our approach significantly enhances the original ST-GS through extensive experiments across multiple tasks and datasets.
Our findings contribute to the ongoing effort to improve discrete optimization in deep learning.
- Score: 4.427325225595673
- License:
- Abstract: Discrete representations play a crucial role in many deep learning architectures, yet their non-differentiable nature poses significant challenges for gradient-based optimization. To address this issue, various gradient estimators have been developed, including the Straight-Through Gumbel-Softmax (ST-GS) estimator, which combines the Straight-Through Estimator (STE) and the Gumbel-based reparameterization trick. However, the performance of ST-GS is highly sensitive to temperature, with its selection often compromising gradient fidelity. In this work, we propose a simple yet effective extension to ST-GS by employing decoupled temperatures for forward and backward passes, which we refer to as "Decoupled ST-GS". We show that our approach significantly enhances the original ST-GS through extensive experiments across multiple tasks and datasets. We further investigate the impact of our method on gradient fidelity from multiple perspectives, including the gradient gap and the bias-variance trade-off of estimated gradients. Our findings contribute to the ongoing effort to improve discrete optimization in deep learning, offering a practical solution that balances simplicity and effectiveness.
Related papers
- Bayesian Parameter Shift Rule in Variational Quantum Eigensolvers [4.431744869863552]
In this paper, we propose its Bayesian variant, where Gaussian processes with appropriate kernels are used to estimate the gradient of the VQE objective.
In gradient descent (SGD), the flexibility of Bayesian PSR allows the reuse of observations in previous steps, which accelerates the optimization process.
Our numerical experiments show that the VQE optimization with Bayesian PSR and GradCoRe significantly accelerates SGD and outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2025-02-04T14:44:31Z) - On the Convergence of DP-SGD with Adaptive Clipping [56.24689348875711]
Gradient Descent with gradient clipping is a powerful technique for enabling differentially private optimization.
This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD)
We show how QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but can be mitigated through a carefully designed quantile and step size schedule.
arXiv Detail & Related papers (2024-12-27T20:29:47Z) - Fast and Slow Gradient Approximation for Binary Neural Network Optimization [11.064044986709733]
hypernetwork based methods utilize neural networks to learn the gradients of non-differentiable quantization functions.
We propose a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization.
We also introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients.
arXiv Detail & Related papers (2024-12-16T13:48:40Z) - Enhancing DP-SGD through Non-monotonous Adaptive Scaling Gradient Weight [15.139854970044075]
We introduce Differentially Private Per-sample Adaptive Scaling Clipping (DP-PSASC)
This approach replaces traditional clipping with non-monotonous adaptive gradient scaling.
Our theoretical and empirical analyses confirm that DP-PSASC preserves gradient privacy and delivers superior performance across diverse datasets.
arXiv Detail & Related papers (2024-11-05T12:47:30Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Enhancing Generalization of Universal Adversarial Perturbation through
Gradient Aggregation [40.18851174642427]
Deep neural networks are vulnerable to universal adversarial perturbation (UAP)
In this paper, we examine the serious dilemma of UAP generation methods from a generalization perspective.
We propose a simple and effective method called Gradient Aggregation (SGA)
SGA alleviates the gradient vanishing and escapes from poor local optima at the same time.
arXiv Detail & Related papers (2023-08-11T08:44:58Z) - Training Discrete Deep Generative Models via Gapped Straight-Through
Estimator [72.71398034617607]
We propose a Gapped Straight-Through ( GST) estimator to reduce the variance without incurring resampling overhead.
This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax.
Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks.
arXiv Detail & Related papers (2022-06-15T01:46:05Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Biased Stochastic First-Order Methods for Conditional Stochastic Optimization and Applications in Meta Learning [24.12941820827126]
We propose a biased gradient descent (BSGD) for Conditional optimization problems.
Our lower bound analysis shows that BSGD cannot be improved for general convex objectives non objectives.
For this special setting, we propose an accelerated algorithm called biased SpiderBoost (BSpiderBoost) that matches the lower bound.
arXiv Detail & Related papers (2020-02-25T10:57:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.