DORB: Dynamically Optimizing Multiple Rewards with Bandits
- URL: http://arxiv.org/abs/2011.07635v1
- Date: Sun, 15 Nov 2020 21:57:47 GMT
- Title: DORB: Dynamically Optimizing Multiple Rewards with Bandits
- Authors: Ramakanth Pasunuru, Han Guo, Mohit Bansal
- Abstract summary: Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks.
We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit)
We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
- Score: 101.68525259222164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradients-based reinforcement learning has proven to be a promising
approach for directly optimizing non-differentiable evaluation metrics for
language generation tasks. However, optimizing for a specific metric reward
leads to improvements in mostly that metric only, suggesting that the model is
gaming the formulation of that metric in a particular way without often
achieving real qualitative improvements. Hence, it is more beneficial to make
the model optimize multiple diverse metric rewards jointly. While appealing,
this is challenging because one needs to manually decide the importance and
scaling weights of these metric rewards. Further, it is important to consider
using a dynamic combination and curriculum of metric rewards that flexibly
changes over time. Considering the above aspects, in our work, we automate the
optimization of multiple metric rewards simultaneously via a multi-armed bandit
approach (DORB), where at each round, the bandit chooses which metric reward to
optimize next, based on expected arm gains. We use the Exp3 algorithm for
bandits and formulate two approaches for bandit rewards: (1) Single
Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit
(HM-Bandit). We empirically show the effectiveness of our approaches via
various automatic metrics and human evaluation on two important NLG tasks:
question generation and data-to-text generation, including on an unseen-test
transfer setup. Finally, we present interpretable analyses of the learned
bandit curriculum over the optimized rewards.
Related papers
- Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation [21.983823344984483]
We study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation.
We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously.
arXiv Detail & Related papers (2024-03-20T13:24:41Z) - MORL-Prompt: An Empirical Analysis of Multi-Objective Reinforcement Learning for Discrete Prompt Optimization [45.410121761165634]
RL-based techniques can be employed to search for prompts that, when fed into a target language model, maximize a set of user-specified reward functions.
Current techniques focus on maximizing the average of reward functions, which does not necessarily lead to prompts that achieve balance across rewards.
arXiv Detail & Related papers (2024-02-18T21:25:09Z) - MultiSlot ReRanker: A Generic Model-based Re-Ranking Framework in
Recommendation Systems [6.0232112783722]
We propose a generic model-based re-ranking framework, MultiSlot ReRanker, which simultaneously optimize relevance, diversity, and freshness.
We have built a multi-slot re-ranking simulator based on OpenAI Gym integrated with the Ray framework.
arXiv Detail & Related papers (2024-01-11T23:17:07Z) - A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback.
Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training.
We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Rewarded soups: towards Pareto-optimal alignment by interpolating
weights fine-tuned on diverse rewards [101.7246658985579]
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data.
We propose embracing the heterogeneity of diverse rewards by following a multi-policy strategy.
We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks.
arXiv Detail & Related papers (2023-06-07T14:58:15Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Choosing the Best of Both Worlds: Diverse and Novel Recommendations
through Multi-Objective Reinforcement Learning [68.45370492516531]
We introduce Scalarized Multi-Objective Reinforcement Learning (SMORL) for the Recommender Systems (RS) setting.
SMORL agent augments standard recommendation models with additional RL layers that enforce it to simultaneously satisfy three principal objectives: accuracy, diversity, and novelty of recommendations.
Our experimental results on two real-world datasets reveal a substantial increase in aggregate diversity, a moderate increase in accuracy, reduced repetitiveness of recommendations, and demonstrate the importance of reinforcing diversity and novelty as complementary objectives.
arXiv Detail & Related papers (2021-10-28T13:22:45Z) - Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping.
The proposed framework alternates between updating a policy and inferring a reward function.
Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.