TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2407.16574v2
- Date: Sun, 08 Dec 2024 14:22:13 GMT
- Title: TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
- Authors: Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae Nam, Daejin Jo, Kyoung-Woon On, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo,
- Abstract summary: We introduce TLCR (Token-Level Continuous Reward) for Reinforcement Learning from Human Feedback.
Our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards.
- Score: 24.4488286574098
- License:
- Abstract: Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.
Related papers
- Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model [96.20350225621813]
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference.
In this paper, we seek to get the best of both by training and utilizing a segment-level reward model.
arXiv Detail & Related papers (2025-01-06T06:17:56Z) - T-REG: Preference Optimization with Token-Level Reward Regularization [35.07328450591201]
Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models with human values.
Recent methods have attempted to address this limitation by introducing token-level rewards.
We propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization.
arXiv Detail & Related papers (2024-12-03T18:56:07Z) - Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment [33.5805074836187]
Reinforcement Learning from Human Feedback (RLHF) has proven highly effective in aligning Large Language Models (LLMs) with human preferences.
This limitation stems from RLHF's lack of awareness regarding which specific tokens should be reinforced or suppressed.
We propose the Adaptive Message-wise RLHF'' method, which robustly applies to various tasks.
arXiv Detail & Related papers (2024-10-23T16:16:15Z) - SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks [13.600674179059238]
We propose a flexible objective termed SparsePO to automatically learn to weight the KL divergence and reward corresponding to each token during Preference Optimization training.
We show that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.
arXiv Detail & Related papers (2024-10-07T15:01:29Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Perception and Semantic Aware Regularization for Sequential Confidence
Calibration [12.265757315192497]
We propose a Perception and Semantic aware Sequence Regularization framework.
We introduce a semantic context-free recognition and a language model to acquire similar sequences with high perceptive similarities and semantic correlation.
Experiments on canonical sequence recognition tasks, including scene text and speech recognition, demonstrate that our method sets novel state-of-the-art results.
arXiv Detail & Related papers (2023-05-31T02:16:29Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - A Simple Contrastive Learning Objective for Alleviating Neural Text
Degeneration [56.64703901898937]
We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training.
Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields less repetitive texts.
arXiv Detail & Related papers (2022-05-05T08:50:50Z) - Token-level Adaptive Training for Neural Machine Translation [84.69646428587548]
There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies.
vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies.
Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
arXiv Detail & Related papers (2020-10-09T05:55:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.