Related papers: Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

URL: http://arxiv.org/abs/2509.11452v1
Date: Sun, 14 Sep 2025 21:56:35 GMT
Title: Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting
Authors: Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang,
Abstract summary: Prior works in multi-reward learning typically use linear scalarization with fixed weights, which fail to capture effective online learning.<n>We introduce two approaches to increasing objective alignment, one for online learning, the other for space exploration.
Score: 48.87957020168614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prior works in multi-objective reinforcement learning typically use linear reward scalarization with fixed weights, which provably fail to capture non-convex Pareto fronts and thus yield suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: (1) hypervolume-guided weight adaptation and (2) gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms (including GRPO, REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.

Related papers

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
Learning an Efficient Optimizer via Hybrid-Policy Sub-Trajectory Balance [42.630489353592786]
Recent advances in generative modeling enable neural networks to generate weights without relying on gradient-based optimization.<n>Lo-Hp is a decoupled two-stage weight generation framework that enhances flexibility through learning various optimization policies.<n>We demonstrate that learning solely local optimization policies can address the long-horizon issue while enhancing the generation of global optimal weights.
arXiv Detail & Related papers (2025-11-01T13:08:28Z)
ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning [17.928214942495412]
ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase.<n>We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro.<n>Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
arXiv Detail & Related papers (2025-10-01T09:11:27Z)
Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment [5.276657230880984]
Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences.<n>Direct Optimization Preference (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs.<n>We propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback.<n>Our method consistently outperforms standard DPO on alignment while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.
arXiv Detail & Related papers (2025-06-24T16:47:17Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost. We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion. By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z)
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment [46.44464839353993]
We introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context. RiC only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time.
arXiv Detail & Related papers (2024-02-15T18:58:31Z)
Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task. The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance. We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z)
Supervised Contrastive Learning as Multi-Objective Optimization for Fine-Tuning Large Pre-trained Language Models [3.759936323189417]
Supervised Contrastive Learning (SCL) has been shown to achieve excellent performance in most classification tasks. In this work, we formulate the SCL problem as a Multi-Objective Optimization problem for the fine-tuning phase of RoBERTa language model.
arXiv Detail & Related papers (2022-09-28T15:13:58Z)
PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm [0.18416014644193063]
We propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
arXiv Detail & Related papers (2022-08-16T19:23:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.