Aligning Distributionally Robust Optimization with Practical Deep Learning Needs
- URL: http://arxiv.org/abs/2508.16734v2
- Date: Thu, 25 Sep 2025 15:03:41 GMT
- Title: Aligning Distributionally Robust Optimization with Practical Deep Learning Needs
- Authors: Dmitrii Feoktistov, Igor Ignashin, Andrey Veprikov, Nikita Borovko, Alexander Bogdanov, Savelii Chezhegov, Aleksandr Beznosikov,
- Abstract summary: While traditional Learning (DL) methods treat all samples equally, a significant gap exists between DRO and current DL practices.<n>This paper aims to bridge the gap by introducing an adaptive algorithm for a modified DRO objective that can handle weight assignment groups.
- Score: 70.87757502315293
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While traditional Deep Learning (DL) optimization methods treat all training samples equally, Distributionally Robust Optimization (DRO) adaptively assigns importance weights to different samples. However, a significant gap exists between DRO and current DL practices. Modern DL optimizers require adaptivity and the ability to handle stochastic gradients, as these methods demonstrate superior performance. Additionally, for practical applications, a method should allow weight assignment not only to individual samples, but also to groups of objects (for example, all samples of the same class). This paper aims to bridge this gap by introducing ALSO $\unicode{x2013}$ Adaptive Loss Scaling Optimizer $\unicode{x2013}$ an adaptive algorithm for a modified DRO objective that can handle weight assignment to sample groups. We prove the convergence of our proposed algorithm for non-convex objectives, which is the typical case for DL models. Empirical evaluation across diverse Deep Learning tasks, from Tabular DL to Split Learning tasks, demonstrates that ALSO outperforms both traditional optimizers and existing DRO methods.
Related papers
- Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW [2.028622227373579]
gradient-based descent (SGD) have long been central to training large language models (LLMs)<n>This paper proposes a conjugate subgradient method together with adaptive sampling specifically for training LLMs.
arXiv Detail & Related papers (2025-07-01T23:30:15Z) - Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment [5.276657230880984]
Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences.<n>Direct Optimization Preference (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs.<n>We propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback.<n>Our method consistently outperforms standard DPO on alignment while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.
arXiv Detail & Related papers (2025-06-24T16:47:17Z) - Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization [37.54165341391688]
We introduce a novel problem: Sample Scheduling for DPO.<n>We propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch.<n>This work points to a promising new direction for improving LLM alignment through batch-wise sample selection.
arXiv Detail & Related papers (2025-06-08T10:26:09Z) - Taming LLMs by Scaling Learning Rates with Gradient Grouping [49.91587150497186]
Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures.<n>This work introduces Scaling with Gradient Grouping (SGG), an gradient wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling.
arXiv Detail & Related papers (2025-06-01T15:30:37Z) - Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval [32.104911827710936]
We propose a new task-level Distributionally Robust Optimization (tDRO) algorithm for Large Language Model-based Dense Retrieval fine-tuning.<n>The tDRO parameterizes the domain weights and updates them with scaled domain gradients.<n>Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage.
arXiv Detail & Related papers (2024-08-20T07:48:19Z) - DRAUC: An Instance-wise Distributionally Robust AUC Optimization
Framework [133.26230331320963]
Area Under the ROC Curve (AUC) is a widely employed metric in long-tailed classification scenarios.
We propose an instance-wise surrogate loss of Distributionally Robust AUC (DRAUC) and build our optimization framework on top of it.
arXiv Detail & Related papers (2023-11-06T12:15:57Z) - Learning Distributionally Robust Models at Scale via Composite
Optimization [45.47760229170775]
We show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods.
We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
arXiv Detail & Related papers (2022-03-17T20:47:42Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - An Online Method for A Class of Distributionally Robust Optimization
with Non-Convex Objectives [54.29001037565384]
We propose a practical online method for solving a class of online distributionally robust optimization (DRO) problems.
Our studies demonstrate important applications in machine learning for improving the robustness of networks.
arXiv Detail & Related papers (2020-06-17T20:19:25Z) - Adaptive Learning of the Optimal Batch Size of SGD [52.50880550357175]
We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions.
Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour.
We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
arXiv Detail & Related papers (2020-05-03T14:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.