Related papers: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization

Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization

URL: http://arxiv.org/abs/2306.09222v5
Date: Sun, 13 Oct 2024 04:07:11 GMT
Title: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization
Authors: Ramnath Kumar, Kushal Majmundar, Dheeraj Nagaraj, Arun Sai Suggala,
Abstract summary: We present Reweighted Gradient Descent (RGD), a novel optimization technique that improves the performance of deep neural networks through dynamic sample re-weighting. We demonstrate the effectiveness of RGD on various learning tasks, including supervised learning, meta-learning, and out-of-domain generalization.
Score: 14.23697277904244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Re-weighted Gradient Descent (RGD), a novel optimization technique that improves the performance of deep neural networks through dynamic sample re-weighting. Leveraging insights from distributionally robust optimization (DRO) with Kullback-Leibler divergence, our method dynamically assigns importance weights to training data during each optimization step. RGD is simple to implement, computationally efficient, and compatible with widely used optimizers such as SGD and Adam. We demonstrate the effectiveness of RGD on various learning tasks, including supervised learning, meta-learning, and out-of-domain generalization. Notably, RGD achieves state-of-the-art results on diverse benchmarks, with improvements of +0.7% on DomainBed, +1.44% on tabular classification, \textcolor{blue}+1.94% on GLUE with BERT, and +1.01% on ImageNet-1K with ViT.

Related papers

EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration [17.190325630307097]
We propose an Exponential-Decay Free-Form Deformation Network (EDFFDNet), which employs free-form deformation with an exponential-decay basis function.<n>By transforming dense interactions into sparse ones, ASMA reduces parameters and improves accuracy.<n>Experiments demonstrate that EDFFDNet reduces parameters, memory, and total runtime by 70.5%, 32.6%, and 33.7%, respectively.<n>EDFFDNet-2 further improves PSNR by 1.06 dB while maintaining lower computational costs.
arXiv Detail & Related papers (2025-09-09T12:30:51Z)
Aligning Distributionally Robust Optimization with Practical Deep Learning Needs [70.87757502315293]
While traditional Learning (DL) methods treat all samples equally, a significant gap exists between DRO and current DL practices.<n>This paper aims to bridge the gap by introducing an adaptive algorithm for a modified DRO objective that can handle weight assignment groups.
arXiv Detail & Related papers (2025-08-22T18:17:44Z)
Flow-GRPO: Training Flow Matching Models via Online RL [75.70017261794422]
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Equation (ODE) into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number.
arXiv Detail & Related papers (2025-05-08T17:58:45Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Adaptive Friction in Deep Learning: Enhancing Optimizers with Sigmoid and Tanh Function [0.0]
We introduce sigSignGrad and tanhSignGrad, two novel gradients that integrate adaptive friction coefficients. Our theoretical analysis demonstrates the wide-ranging adjustment capability of the friction coefficient S. Experiments on CIFAR-10, Mini-Image-Net using ResNet50 and ViT architectures confirm the superior performance our proposeds.
arXiv Detail & Related papers (2024-08-07T03:20:46Z)
Variational Stochastic Gradient Descent for Deep Neural Networks [16.96187187108041]
Current state-of-the-arts are adaptive gradient-based optimization methods such as Adam. Here, we propose to combine both approaches, resulting in the Variational Gradient Descent (VSGD) We show how our VSGD method relates to other adaptive gradient-baseds like Adam.
arXiv Detail & Related papers (2024-04-09T18:02:01Z)
Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms [13.134564730161983]
This paper adopts a novel approach to deep learning optimization, focusing on gradient descent (SGD) and its variants. We show that SGD and its variants demonstrate performance on par with flat-minimas like SAM, albeit with half the gradient evaluations. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD and noise-enabled variants.
arXiv Detail & Related papers (2024-03-01T14:55:22Z)
Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks [48.089501687522954]
We propose a novel layer-adaptive weight-pruning approach for Deep Neural Networks (DNNs) Our approach takes into account the collective influence of all layers to design a layer-adaptive pruning scheme. Our experiments demonstrate the superiority of our approach over existing methods on the ImageNet and CIFAR-10 datasets.
arXiv Detail & Related papers (2023-08-21T03:22:47Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and Stability [67.8426046908398]
Generalizability and stability are two key objectives for operating reinforcement learning (RL) agents in the real world. This paper presents MetaPG, an evolutionary method for automated design of actor-critic loss functions.
arXiv Detail & Related papers (2022-04-08T20:46:16Z)
ZARTS: On Zero-order Optimization for Neural Architecture Search [94.41017048659664]
Differentiable architecture search (DARTS) has been a popular one-shot paradigm for NAS due to its high efficiency. This work turns to zero-order optimization and proposes a novel NAS scheme, called ZARTS, to search without enforcing the above approximation. In particular, results on 12 benchmarks verify the outstanding robustness of ZARTS, where the performance of DARTS collapses due to its known instability issue.
arXiv Detail & Related papers (2021-10-10T09:35:15Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Generalized Reinforcement Meta Learning for Few-Shot Optimization [3.7675996866306845]
We present a generic and flexible Reinforcement Learning (RL) based meta-learning framework for the problem of few-shot learning. Our framework could be easily extended to do network architecture search.
arXiv Detail & Related papers (2020-05-04T03:21:05Z)
Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning [8.366415386275557]
Solution involves a reformation of the objective function for optimization in neural network models. We introduce a decentralized weighted aggregating scheme based on the performance of local workers. To validate the new method, we benchmark our schemes against several popular algorithms.
arXiv Detail & Related papers (2020-04-07T23:38:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.