Related papers: Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback

URL: http://arxiv.org/abs/2403.11330v2
Date: Tue, 23 Apr 2024 03:17:07 GMT
Title: Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback
Authors: Dong Won Lee, Hae Won Park, Yoon Kim, Cynthia Breazeal, Louis-Philippe Morency,
Abstract summary: We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
Score: 71.55265615594669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. At a high level, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session-level reward, using Local Implicit (LI) multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the standard RHLF pipeline improve an LLM-based dialog agent. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.

Related papers

Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment [4.406086834602686]
We show how to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution. Our approach utilizes a centralized reward-critic which numerically decomposes the environment reward based on the individual contribution of each agent. Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints.
arXiv Detail & Related papers (2025-02-24T05:56:47Z)
EPO: Hierarchical LLM Agents with Environment Preference Optimization [25.682236898002]
We propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals.
arXiv Detail & Related papers (2024-08-28T18:44:02Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences. Existing methods work by emulating the preferences at the single decision (turn) level. We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model. This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models. Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)
Taming Continuous Posteriors for Latent Variational Dialogue Policies [1.0312968200748118]
We revisit Gaussian variational posteriors for latent-action RL and show that they can yield even better performance than categoricals. We achieve this by simplifying the training procedure and propose ways to regularize the latent dialogue policy.
arXiv Detail & Related papers (2022-05-16T12:50:32Z)
Imperfect also Deserves Reward: Multi-Level and Sequential Reward Modeling for Better Dialog Management [17.168214640974337]
For task-oriented dialog systems, training a Reinforcement Learning based Dialog Management module suffers from low sample efficiency and slow convergence speed due to the sparse rewards in RL. We propose a multi-level reward modeling approach that factorizes a reward into a three-level hierarchy: domain, act, and slot.
arXiv Detail & Related papers (2021-04-10T12:20:23Z)
Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation [33.688270031454095]
We introduce reward learning to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues. We propose a novel reward learning approach for semi-supervised policy learning.
arXiv Detail & Related papers (2020-05-09T06:28:44Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.