Related papers: InfoPO: Information-Driven Policy Optimization for User-Centric Agents

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

URL: http://arxiv.org/abs/2603.00656v1
Date: Sat, 28 Feb 2026 13:58:14 GMT
Title: InfoPO: Information-Driven Policy Optimization for User-Centric Agents
Authors: Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu,
Abstract summary: We introduce InfoPO, which frames multi-turn interaction as a process of active uncertainty reduction.<n>It computes an information-gain reward that credits turns whose feedback changes the agent's subsequent action distribution.<n>It then combines this signal with task outcomes via an adaptive variance-gated fusion.
Score: 39.407032905771885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

Related papers

Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization [61.641777037967366]
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns.<n>Agentic reinforcement learning (RL) has emerged as a promising solution for training such agents in multi-turn settings.<n>We propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities.
arXiv Detail & Related papers (2026-02-11T20:40:43Z)
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents [28.145430029174577]
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments.<n>Existing approaches typically rely on outcome-based rewards that are only provided at the final answer.<n>In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
arXiv Detail & Related papers (2025-10-16T17:59:32Z)
Knowledge Base-Aware Orchestration: A Dynamic, Privacy-Preserving Method for Multi-Agent Systems [39.146761527401424]
We introduce Knowledge Base-Aware (KBA) Orchestration, a novel approach that augments static descriptions with dynamic, privacy-preserving relevance signals.<n>By combining this novel mechanism with static descriptions, our method achieves more accurate and adaptive task routing.<n> Benchmarks show that our KBA Orchestration significantly outperforms static description-driven methods in routing precision and overall system efficiency.
arXiv Detail & Related papers (2025-09-23T21:46:38Z)
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization [58.65395773049273]
Location Preference Optimization (LPO) is a novel approach that leverages locational data to optimize interaction preferences.<n>LPO uses information entropy to predict interaction positions by focusing on zones rich in information.<n>Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.
arXiv Detail & Related papers (2025-06-11T03:43:30Z)
Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents [110.25679611755962]
Current language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions. We introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users' implicit intentions through explicit queries. We empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires user intentions, and refines them into actionable goals.
arXiv Detail & Related papers (2024-02-14T14:36:30Z)
MASP: Scalable GNN-based Planning for Multi-Agent Navigation [18.70078556851899]
Multi-Agent Scalable Graph-based Planner (MASP) is a goal-conditioned hierarchical planner for navigation tasks.<n>MASP employs a hierarchical framework to reduce space complexity by decomposing a large exploration space into multiple goal-conditioned subspaces.<n>For agent cooperation and the adaptation to varying team sizes, we model agents and goals as graphs to better capture their relationship.
arXiv Detail & Related papers (2023-12-05T06:05:04Z)
Asynchronous Message-Passing and Zeroth-Order Optimization Based Distributed Learning with a Use-Case in Resource Allocation in Communication Networks [11.182443036683225]
Distributed learning and adaptation have received significant interest and found wide-ranging applications in machine learning signal processing.<n>This paper specifically focuses on a scenario where agents collaborate towards a common task.<n>Agents, acting as transmitters, collaboratively train their individual policies to maximize a global reward.
arXiv Detail & Related papers (2023-11-08T11:12:27Z)
Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning [46.28771270378047]
Federated reinforcement learning (RL) enables collaborative decision making of multiple distributed agents without sharing local data trajectories. In this work, we consider a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment. We learn a globally optimal policy that maximizes the sum of the discounted total rewards of all the agents in a decentralized manner.
arXiv Detail & Related papers (2023-11-01T00:15:18Z)
Research on Multi-Agent Communication and Collaborative Decision-Making Based on Deep Reinforcement Learning [0.0]
This thesis studies the cooperative decision-making of multi-agent based on the Multi-Agent Proximal Policy Optimization algorithm. Different agents can alleviate the non-stationarity caused by local observations through information exchange between agents. The experimental results show that the improvement has achieved certain effects, which can better alleviate the non-stationarity of the multi-agent environment.
arXiv Detail & Related papers (2023-05-23T14:20:14Z)
Distributed Adaptive Learning Under Communication Constraints [54.22472738551687]
This work examines adaptive distributed learning strategies designed to operate under communication constraints. We consider a network of agents that must solve an online optimization problem from continual observation of streaming data.
arXiv Detail & Related papers (2021-12-03T19:23:48Z)
Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML. We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML. Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.