Related papers: Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

URL: http://arxiv.org/abs/2602.19225v1
Date: Sun, 22 Feb 2026 15:18:03 GMT
Title: Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training
Authors: Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, Peilin Zhao,
Abstract summary: Multi-turn LLM agents are pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management.<n>We propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment.
Score: 26.571744733431448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: \href{https://anonymous.4open.science/r/proxmo-B7E7/README.md}{https://anonymous.4open.science/r/proxmo}.

Related papers

Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z)
Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
Scale, Don't Fine-tune: Guiding Multimodal LLMs for Efficient Visual Place Recognition at Test-Time [12.659582318581606]
Current approaches, including Vision Foundation Models (VFMs) and Multimodal Large Language Models (MLLMs), enhance semantic understanding but suffer from high computational overhead and limited cross-domain transferability when fine-tuned.<n>We propose a novel framework employing Test-Time Scaling (TTS) that leverages vision-language alignment capabilities through Guidance-based methods for direct similarity scoring.<n>Our approach eliminates two-stage processing by employing structured prompts that generate length-controllable scoring outputs.
arXiv Detail & Related papers (2025-09-02T09:25:13Z)
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling [71.37579508777843]
Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities.<n>To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments.
arXiv Detail & Related papers (2025-08-12T05:00:00Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
A Production Scheduling Framework for Reinforcement Learning Under Real-World Constraints [0.0]
Real-world production environments introduce additional complexities that cause traditional scheduling approaches to be less effective.<n>Reinforcement learning (RL) holds potential in addressing these challenges, as it allows agents to learn adaptive scheduling strategies.<n>We propose a modular framework that extends classical JSSP formulations by incorporating key real-world constraints.<n>JobShopLab is an open-source tool for both research and industrial applications.
arXiv Detail & Related papers (2025-06-16T14:50:26Z)
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems [25.882461853973897]
We propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which guides policy updates by estimating relative reward advantages.<n>MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead.<n>We also introduce three group rollout sampling strategies that trade off between efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-03T10:17:19Z)
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [76.42361936804313]
We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design.<n> MAS-ZERO employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance.
arXiv Detail & Related papers (2025-05-21T00:56:09Z)
Performant LLM Agentic Framework for Conversational AI [1.6114012813668932]
We introduce the Performant Agentic Framework (PAF), a novel system that assists Large Language Models (LLMs) in selecting appropriate nodes and executing actions in order when traversing complex graphs.<n>PAF combines LLM-based reasoning with a mathematically grounded vector scoring mechanism, achieving both higher accuracy and reduced latency.<n>Experiments demonstrate that PAF significantly outperforms baseline methods, paving the way for scalable, real-time Conversational AI systems in complex business environments.
arXiv Detail & Related papers (2025-03-09T02:58:34Z)
Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models [14.68920095399595]
sparsity-based PEFT (SPEFT) introduces trainable sparse adaptations to the weight matrices in the model.<n>We conduct the first systematic evaluation of salience metrics for SPEFT, inspired by zero-cost NAS proxies.<n>We compare static and dynamic masking strategies, finding that static masking, which predetermines non-zero entries before training, delivers efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-12-18T04:14:35Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.