MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
- URL: http://arxiv.org/abs/2410.02743v1
- Date: Thu, 3 Oct 2024 17:55:13 GMT
- Title: MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
- Authors: Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu,
- Abstract summary: Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences.
We propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process.
We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
- Score: 46.608747360764035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF .
Related papers
- Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.
Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment [33.5805074836187]
Reinforcement Learning from Human Feedback (RLHF) has proven highly effective in aligning Large Language Models (LLMs) with human preferences.
This limitation stems from RLHF's lack of awareness regarding which specific tokens should be reinforced or suppressed.
We propose the Adaptive Message-wise RLHF'' method, which robustly applies to various tasks.
arXiv Detail & Related papers (2024-10-23T16:16:15Z) - Language Models Learn to Mislead Humans via RLHF [100.95201965748343]
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex.
We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers.
Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
arXiv Detail & Related papers (2024-09-19T14:50:34Z) - Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values.
Despite its effectiveness and popularity, RLHF is prone to biased local optimization.
We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models with human preferences.
To alleviate some of the computational burden of fine-tuning, efficient methods, like LoRA were introduced.
We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering.
arXiv Detail & Related papers (2024-03-15T21:43:46Z) - ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback.
A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores.
Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z) - Sample Efficient Reinforcement Learning from Human Feedback via Active
Exploration [29.935758027209292]
Preference-based feedback is important for many applications in reinforcement learning.
In this work, we take advantage of the fact that one can often choose contexts to obtain human feedback.
We show that our method is able to reach better performance with fewer samples of human preferences than multiple baselines.
arXiv Detail & Related papers (2023-12-01T00:54:02Z) - A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF.
We find improvements in reward to largely be driven by increasing response length, instead of other features.
Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.