Towards Cost-Effective Reward Guided Text Generation
- URL: http://arxiv.org/abs/2502.04517v1
- Date: Thu, 06 Feb 2025 21:36:44 GMT
- Title: Towards Cost-Effective Reward Guided Text Generation
- Authors: Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart,
- Abstract summary: Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback.
We present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a emphsingle call to the reward model at each step of the generation process.
- Score: 27.11836864643437
- License:
- Abstract: Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.
Related papers
- Chain-of-Retrieval Augmented Generation [72.06205327186069]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.
Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z) - Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model [96.20350225621813]
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference.
In this paper, we seek to get the best of both by training and utilizing a segment-level reward model.
arXiv Detail & Related papers (2025-01-06T06:17:56Z) - Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - A Critical Look At Tokenwise Reward-Guided Text Generation [23.908449840589284]
We show that reward models trained on full sequences are not compatible with scoring partial sequences.
We propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding time.
arXiv Detail & Related papers (2024-06-12T00:19:40Z) - ESRL: Efficient Sampling-based Reinforcement Learning for Sequence
Generation [43.506732624371786]
We introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL.
Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
arXiv Detail & Related papers (2023-08-04T09:35:45Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - GROOT: Corrective Reward Optimization for Generative Sequential Labeling [10.306943706927004]
We propose GROOT -- a framework for Generative Reward Optimization Of Text sequences.
GROOT works by training a generative sequential labeling model to match the decoder output distribution with that of the (black-box) reward function.
As demonstrated via extensive experiments on four public benchmarks, GROOT significantly improves all reward metrics.
arXiv Detail & Related papers (2022-09-29T11:35:47Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.