ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation
- URL: http://arxiv.org/abs/2601.15330v1
- Date: Tue, 20 Jan 2026 04:39:28 GMT
- Title: ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation
- Authors: Zhebo Wang, Xiaohu Mu, Zijie Zhou, Mohan Li, Wenpeng Xing, Dezhang Kong, Meng Han,
- Abstract summary: Large Language Models (LLMs) in multi-turn conversations often suffer from a lost-in-conversation'' phenomenon.<n>We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue.<n>We propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity.
- Score: 21.69212076688903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user's illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75\% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.
Related papers
- Provable and Practical In-Context Policy Optimization for Self-Improvement [49.670847804409874]
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference.<n>We introduce In-Context Policy Optimization (ICPO), in which an agent optimize its response in context using self-assessed or externally observed rewards without modifying its parameters.<n>We propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time.
arXiv Detail & Related papers (2026-03-02T00:21:50Z) - Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning [66.52010873968383]
We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training.<n>The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods.
arXiv Detail & Related papers (2026-01-19T14:55:54Z) - SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning [46.70182219204539]
We introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users.<n>We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting.
arXiv Detail & Related papers (2025-12-15T10:08:53Z) - ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning [17.98065634130798]
We propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO)<n>ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt.<n>We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors.
arXiv Detail & Related papers (2025-11-26T03:10:15Z) - DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model [65.93900011975238]
DELULU is a speaker-aware self-supervised foundational model for verification, diarization, and profiling applications.<n>It is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization.<n>Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
arXiv Detail & Related papers (2025-10-20T15:35:55Z) - STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z) - A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions [8.717610965852037]
We propose a novel training paradigm to generate diverse responses of a given proficiency level.<n>We convert responses into synthesized speech via speaker-aware text-to-speech synthesis.<n>A multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly.
arXiv Detail & Related papers (2025-06-04T15:42:53Z) - Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent.
We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents.
Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z) - Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training [33.57497419019826]
Action-Based Contrastive Self-Training enables data-efficient dialogue policy learning in multi-turn conversation modeling.<n>We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available.<n>We also propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation.
arXiv Detail & Related papers (2024-05-31T22:44:48Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Characterizing the adversarial vulnerability of speech self-supervised
learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries.
The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z) - An Adversarially-Learned Turing Test for Dialog Generation Models [45.991035017908594]
We propose an adversarial training approach to learn a robust model, ATT, that discriminates machine-generated responses from human-written replies.
In contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples.
Our discriminator shows high accuracy on strong attackers including DialoGPT and GPT-3.
arXiv Detail & Related papers (2021-04-16T17:13:14Z) - Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion
Dialogues via Reinforcement Learning and Human Demonstration [45.14559188965439]
We propose to apply reinforcement learning to refine an MLE-based language model without user simulators.
We distill sentence-level information about repetition, inconsistency and task relevance through rewards.
Experiments show that our model outperforms previous state-of-the-art dialogue models on both automatic metrics and human evaluation results.
arXiv Detail & Related papers (2020-12-31T00:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.