Related papers: DORB: Dynamically Optimizing Multiple Rewards with Bandits

DORB: Dynamically Optimizing Multiple Rewards with Bandits

URL: http://arxiv.org/abs/2011.07635v1
Date: Sun, 15 Nov 2020 21:57:47 GMT
Title: DORB: Dynamically Optimizing Multiple Rewards with Bandits
Authors: Ramakanth Pasunuru, Han Guo, Mohit Bansal
Abstract summary: Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks. We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit) We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
Score: 101.68525259222164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policy gradients-based reinforcement learning has proven to be a promising approach for directly optimizing non-differentiable evaluation metrics for language generation tasks. However, optimizing for a specific metric reward leads to improvements in mostly that metric only, suggesting that the model is gaming the formulation of that metric in a particular way without often achieving real qualitative improvements. Hence, it is more beneficial to make the model optimize multiple diverse metric rewards jointly. While appealing, this is challenging because one needs to manually decide the importance and scaling weights of these metric rewards. Further, it is important to consider using a dynamic combination and curriculum of metric rewards that flexibly changes over time. Considering the above aspects, in our work, we automate the optimization of multiple metric rewards simultaneously via a multi-armed bandit approach (DORB), where at each round, the bandit chooses which metric reward to optimize next, based on expected arm gains. We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit). We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks: question generation and data-to-text generation, including on an unseen-test transfer setup. Finally, we present interpretable analyses of the learned bandit curriculum over the optimized rewards.

Related papers

TreeRPO: Tree Relative Policy Optimization [55.97385410074841]
name is a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling.<n>Building on the group-relative reward training mechanism of GRPO, name innovatively computes rewards based on step-level groups generated during tree sampling.
arXiv Detail & Related papers (2025-06-05T15:56:38Z)
Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z)
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment [44.74709190788796]
Well-tuned reward-based pipelines remain the most robust.<n>textbfDR-IRL adjusts dynamically rewards through inverse reinforcement learning.<n>textbfGRPO-S scales the reward during optimization to task difficulty.
arXiv Detail & Related papers (2025-03-23T16:40:29Z)
Robust Multi-Objective Controlled Decoding of Large Language Models [14.58153072993207]
We introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimize for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically.
arXiv Detail & Related papers (2025-03-11T18:15:26Z)
Semi-Parametric Batched Global Multi-Armed Bandits with Covariates [0.48342038441006807]
The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making. We propose a novel semi-parametric framework for batched bandits with coparametrics and a shared parameter across arms. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy.
arXiv Detail & Related papers (2025-03-01T17:23:55Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs. We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms. Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions. We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z)
Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation [21.983823344984483]
We study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously.
arXiv Detail & Related papers (2024-03-20T13:24:41Z)
MORL-Prompt: An Empirical Analysis of Multi-Objective Reinforcement Learning for Discrete Prompt Optimization [45.410121761165634]
RL-based techniques can be employed to search for prompts that, when fed into a target language model, maximize a set of user-specified reward functions. Current techniques focus on maximizing the average of reward functions, which does not necessarily lead to prompts that achieve balance across rewards.
arXiv Detail & Related papers (2024-02-18T21:25:09Z)
MultiSlot ReRanker: A Generic Model-based Re-Ranking Framework in Recommendation Systems [6.0232112783722]
We propose a generic model-based re-ranking framework, MultiSlot ReRanker, which simultaneously optimize relevance, diversity, and freshness. We have built a multi-slot re-ranking simulator based on OpenAI Gym integrated with the Ray framework.
arXiv Detail & Related papers (2024-01-11T23:17:07Z)
A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards [101.7246658985579]
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. We propose embracing the heterogeneity of diverse rewards by following a multi-policy strategy. We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks.
arXiv Detail & Related papers (2023-06-07T14:58:15Z)
Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL) Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z)
Choosing the Best of Both Worlds: Diverse and Novel Recommendations through Multi-Objective Reinforcement Learning [68.45370492516531]
We introduce Scalarized Multi-Objective Reinforcement Learning (SMORL) for the Recommender Systems (RS) setting. SMORL agent augments standard recommendation models with additional RL layers that enforce it to simultaneously satisfy three principal objectives: accuracy, diversity, and novelty of recommendations. Our experimental results on two real-world datasets reveal a substantial increase in aggregate diversity, a moderate increase in accuracy, reduced repetitiveness of recommendations, and demonstrate the importance of reinforcing diversity and novelty as complementary objectives.
arXiv Detail & Related papers (2021-10-28T13:22:45Z)
Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping. The proposed framework alternates between updating a policy and inferring a reward function. Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.