Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition
- URL: http://arxiv.org/abs/2505.15922v1
- Date: Wed, 21 May 2025 18:19:45 GMT
- Title: Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition
- Authors: Dong Won Lee, Hae Won Park, Cynthia Breazeal, Louis-Philippe Morency,
- Abstract summary: We propose a large language model based reward decomposition framework for aligning dialogue agents.<n>We leverage the reasoning capabilities of a frozen, pretrained large language model to infer fine-grained local implicit rewards.<n>We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods.
- Score: 57.732148933412425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.
Related papers
- Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z) - Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment [0.618727087412292]
The alignment of large language models (LLMs) is crucial for generating helpful and harmless content.
Existing approaches leverage preference-based human feedback data to learn the reward function.
We propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Learning (AVRIL)
arXiv Detail & Related papers (2024-11-14T10:37:34Z) - LLMs are Superior Feedback Providers: Bootstrapping Reasoning for Lie Detection with Self-Generated Feedback [33.14770105185958]
Large Language Models (LLMs) excel at generating human-like dialogues and comprehending text.
We propose a bootstrapping framework that leverages self-generated feedback to enhance LLM reasoning capabilities for lie detection.
We investigate the application of the proposed framework for detecting betrayal and deception in Diplomacy games, and compare it with feedback from professional human players.
arXiv Detail & Related papers (2024-08-25T18:47:55Z) - FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback [16.24562885483636]
We propose an innovative method to align modalities in Large Vision-Language Models (LVLMs) through Fine-Grained Artificial Intelligence Feedback (FGAIF)<n> Specifically, we first utilize AI tools to predict the types of hallucination for each segment in the response and obtain a collection of fine-grained feedback. Then, based on the collected reward data, three specialized reward models are trained to produce dense rewards. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm.
arXiv Detail & Related papers (2024-04-07T19:00:45Z) - Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback [71.55265615594669]
We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals.
We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
arXiv Detail & Related papers (2024-03-17T20:21:26Z) - Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language
Model Critique in Text Generation [29.6763730290473]
Reinforcement learning can align language models with non-differentiable reward signals, such as human preferences.
This paper introduces a novel framework that utilizes the critique capability of Large Language Models to produce intermediate-step rewards.
arXiv Detail & Related papers (2024-01-14T22:05:11Z) - JoTR: A Joint Transformer and Reinforcement Learning Framework for
Dialog Policy Learning [53.83063435640911]
Dialogue policy learning (DPL) is a crucial component of dialogue modelling.
We introduce a novel framework, JoTR, to generate flexible dialogue actions.
Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation.
arXiv Detail & Related papers (2023-09-01T03:19:53Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z) - SimOAP: Improve Coherence and Consistency in Persona-based Dialogue
Generation via Over-sampling and Post-evaluation [54.66399120084227]
Language models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue.
For the persona-based dialogue generation task, consistency and coherence are great challenges for language models.
A two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation.
arXiv Detail & Related papers (2023-05-18T17:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.