Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
- URL: http://arxiv.org/abs/2403.13578v1
- Date: Wed, 20 Mar 2024 13:24:41 GMT
- Title: Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
- Authors: Do June Min, Veronica Perez-Rosas, Kenneth Resnicow, Rada Mihalcea,
- Abstract summary: We study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation.
We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously.
- Score: 21.983823344984483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We focus on the task of counselor reflection generation, where we optimize the generators to simultaneously improve the fluency, coherence, and reflection quality of generated counselor responses. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously. Specifically, we employ non-contextual and contextual multi-arm bandits to dynamically adjust multiple reward weights during training. Through automatic and manual evaluations, we show that our proposed techniques, DynaOpt and C-DynaOpt, outperform existing naive and bandit baselines, showcasing their potential for enhancing language models.
Related papers
- Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent.
We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents.
Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z) - Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.
Existing methods work by emulating the preferences at the single decision (turn) level.
We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z) - MORL-Prompt: An Empirical Analysis of Multi-Objective Reinforcement Learning for Discrete Prompt Optimization [45.410121761165634]
RL-based techniques can be employed to search for prompts that, when fed into a target language model, maximize a set of user-specified reward functions.
Current techniques focus on maximizing the average of reward functions, which does not necessarily lead to prompts that achieve balance across rewards.
arXiv Detail & Related papers (2024-02-18T21:25:09Z) - Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation [40.74782694945025]
We propose Parrot, which addresses the issue of manually adjusting reward weights.
We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network.
We also introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion.
arXiv Detail & Related papers (2024-01-11T05:36:36Z) - MultiPrompter: Cooperative Prompt Optimization with Multi-Agent
Reinforcement Learning [68.40755873520808]
MultiPrompter is a new framework that views prompt optimization as a cooperative game between prompters.
We show that MultiPrompter effectively reduces the problem size and helps prompters learn optimal prompts.
arXiv Detail & Related papers (2023-10-25T15:58:51Z) - Guide Your Agent with Adaptive Multimodal Rewards [107.08768813632032]
This work presents Adaptive Return-conditioned Policy (ARP), an efficient framework to enhance the agent's generalization ability.
Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space.
Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization.
arXiv Detail & Related papers (2023-09-19T17:39:20Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - DORB: Dynamically Optimizing Multiple Rewards with Bandits [101.68525259222164]
Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks.
We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit)
We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
arXiv Detail & Related papers (2020-11-15T21:57:47Z) - Improving GAN Training with Probability Ratio Clipping and Sample
Reweighting [145.5106274085799]
generative adversarial networks (GANs) often suffer from inferior performance due to unstable training.
We propose a new variational GAN training framework which enjoys superior training stability.
By plugging the training approach in diverse state-of-the-art GAN architectures, we obtain significantly improved performance over a range of tasks.
arXiv Detail & Related papers (2020-06-12T01:39:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.