AdaTerm: Adaptive T-Distribution Estimated Robust Moments for
Noise-Robust Stochastic Gradient Optimization
- URL: http://arxiv.org/abs/2201.06714v4
- Date: Tue, 29 Aug 2023 06:41:58 GMT
- Title: AdaTerm: Adaptive T-Distribution Estimated Robust Moments for
Noise-Robust Stochastic Gradient Optimization
- Authors: Wendyam Eric Lionel Ilboudo, Taisuke Kobayashi and Takamitsu Matsubara
- Abstract summary: We propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all associated statistics.
This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time.
- Score: 14.531550983885772
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the increasing practicality of deep learning applications, practitioners
are inevitably faced with datasets corrupted by noise from various sources such
as measurement errors, mislabeling, and estimated surrogate inputs/outputs that
can adversely impact the optimization results. It is a common practice to
improve the optimization algorithm's robustness to noise, since this algorithm
is ultimately in charge of updating the network parameters. Previous studies
revealed that the first-order moment used in Adam-like stochastic gradient
descent optimizers can be modified based on the Student's t-distribution. While
this modification led to noise-resistant updates, the other associated
statistics remained unchanged, resulting in inconsistencies in the assumed
models. In this paper, we propose AdaTerm, a novel approach that incorporates
the Student's t-distribution to derive not only the first-order moment but also
all the associated statistics. This provides a unified treatment of the
optimization process, offering a comprehensive framework under the statistical
model of the t-distribution for the first time. The proposed approach offers
several advantages over previously proposed approaches, including reduced
hyperparameters and improved robustness and adaptability. This noise-adaptive
behavior contributes to AdaTerm's exceptional learning performance, as
demonstrated through various optimization problems with different and/or
unknown noise ratios. Furthermore, we introduce a new technique for deriving a
theoretical regret bound without relying on AMSGrad, providing a valuable
contribution to the field
Related papers
- Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - Beyond Single-Model Views for Deep Learning: Optimization versus
Generalizability of Stochastic Optimization Algorithms [13.134564730161983]
This paper adopts a novel approach to deep learning optimization, focusing on gradient descent (SGD) and its variants.
We show that SGD and its variants demonstrate performance on par with flat-minimas like SAM, albeit with half the gradient evaluations.
Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD and noise-enabled variants.
arXiv Detail & Related papers (2024-03-01T14:55:22Z) - End-to-End Learning for Fair Multiobjective Optimization Under
Uncertainty [55.04219793298687]
The Predict-Then-Forecast (PtO) paradigm in machine learning aims to maximize downstream decision quality.
This paper extends the PtO methodology to optimization problems with nondifferentiable Ordered Weighted Averaging (OWA) objectives.
It shows how optimization of OWA functions can be effectively integrated with parametric prediction for fair and robust optimization under uncertainty.
arXiv Detail & Related papers (2024-02-12T16:33:35Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Efficient and Differentiable Conformal Prediction with General Function
Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters.
We show that it achieves approximate valid population coverage and near-optimal efficiency within class.
Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Recursive Inference for Variational Autoencoders [34.552283758419506]
Inference networks of traditional Variational Autoencoders (VAEs) are typically amortized.
Recent semi-amortized approaches were proposed to address this drawback.
We introduce an accurate amortized inference algorithm.
arXiv Detail & Related papers (2020-11-17T10:22:12Z) - Real-Time Optimization Meets Bayesian Optimization and Derivative-Free
Optimization: A Tale of Modifier Adaptation [0.0]
This paper investigates a new class of modifier-adaptation schemes to overcome plant-model mismatch in real-time optimization of uncertain processes.
The proposed schemes embed a physical model and rely on trust-region ideas to minimize risk during the exploration.
The benefits of using an acquisition function, knowing the process noise level, or specifying a nominal process model are illustrated.
arXiv Detail & Related papers (2020-09-18T12:57:17Z) - Beyond variance reduction: Understanding the true impact of baselines on
policy optimization [24.09670734037029]
We show that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates.
We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics.
arXiv Detail & Related papers (2020-08-31T17:52:09Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.