Related papers: Learning Steerable Clarification Policies with Collaborative Self-play

Learning Steerable Clarification Policies with Collaborative Self-play

URL: http://arxiv.org/abs/2512.04068v1
Date: Wed, 03 Dec 2025 18:49:54 GMT
Title: Learning Steerable Clarification Policies with Collaborative Self-play
Authors: Jonathan Berant, Maximillian Chen, Adam Fisch, Reza Aghajani, Fantine Huot, Mirella Lapata, Jacob Eisenstein,
Abstract summary: To handle ambiguous queries, AI assistants need a policy for managing their uncertainty.<n>We propose to train steerable policies for managing this uncertainty using self-play.<n>We show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs.
Score: 67.67872810596839
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To handle underspecified or ambiguous queries, AI assistants need a policy for managing their uncertainty to determine (a) when to guess the user intent and answer directly, (b) when to enumerate and answer multiple possible intents, and (c) when to ask a clarifying question. However, such policies are contextually dependent on factors such as user preferences or modality. For example, enumerating multiple possible user intentions is cumbersome on small screens or in a voice setting. In this work, we propose to train steerable policies for managing this uncertainty using self-play. Given two agents, one simulating a user and the other an AI assistant, we generate conversations where the user issues a potentially ambiguous query, and the assistant needs to determine how to respond. Importantly, the model takes as input the numerical cost of each clarification question, and each generated word, and is asked to take the action that will maximize its final reward, which is the cost-penalized accuracy. We use Reinforced Self-Training (ReST) to train our model to achieve high reward and show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs, leading to higher reward and accuracy. Moreover, our procedure also generalizes to numerical cost values that were unobserved at training time.

Related papers

e1: Learning Adaptive Control of Reasoning Effort [88.51897900019485]
Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning.<n>Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost.<n>We propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens.
arXiv Detail & Related papers (2025-10-30T23:12:21Z)
Steering Robots with Inference-Time Interactions [0.5801621787540268]
When a pretrained policy makes errors during deployment, there are limited mechanisms for users to correct its behavior.<n>My research proposes an alternative: keeping pretrained policies frozen as a fixed skill repertoire while allowing user interactions to guide behavior generation at inference time.<n>Specifically, I propose (1) inference-time steering, which leverages user interactions to switch between discrete skills, and (2) task and motion imitation, which enables user interactions to edit continuous motions while satisfying task constraints defined by discrete symbolic plans.
arXiv Detail & Related papers (2025-06-17T07:59:07Z)
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions. We evaluate systems across three NLP applications: question answering, machine translation and natural language inference. We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z)
Social Contract AI: Aligning AI Assistants with Implicit Group Norms [37.68821926786935]
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. We run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players.
arXiv Detail & Related papers (2023-10-26T20:27:03Z)
Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z)
Residual Q-Learning: Offline and Online Policy Customization without Value [53.47311900133564]
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. We formulate a new problem setting called policy customization. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy.
arXiv Detail & Related papers (2023-06-15T22:01:19Z)
Personalized Algorithmic Recourse with Preference Elicitation [20.78332455864586]
We introduce PEAR, the first human-in-the-loop approach capable of providing personalized algorithmic recourse tailored to the needs of any end-user. PEAR builds on insights from Bayesian Preference Elicitation to iteratively refine an estimate of the costs of actions by asking choice set queries to the target user. Our empirical evaluation on real-world datasets highlights how PEAR produces high-quality personalized recourse in only a handful of iterations.
arXiv Detail & Related papers (2022-05-27T03:12:18Z)
Sayer: Using Implicit Feedback to Optimize System Policies [63.992191765269396]
We develop a methodology that leverages implicit feedback to evaluate and train new system policies. Sayer builds on two ideas from reinforcement learning to leverage data collected by an existing policy. We show that Sayer can evaluate arbitrary policies accurately, and train new policies that outperform the production policies.
arXiv Detail & Related papers (2021-10-28T04:16:56Z)
Interactive Question Clarification in Dialogue via Reinforcement Learning [36.746578601398866]
We propose a reinforcement model to clarify ambiguous questions by suggesting refinements of the original query. The model is trained using reinforcement learning with a deep policy network. We evaluate our model based on real-world user clicks and demonstrate significant improvements.
arXiv Detail & Related papers (2020-12-17T06:38:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.