Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning
- URL: http://arxiv.org/abs/2601.23032v1
- Date: Fri, 30 Jan 2026 14:42:04 GMT
- Title: Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning
- Authors: Siyu Gong, Linan Yue, Weibo Gao, Fangzhou Yao, Shimin Di, Lei Feng, Min-Ling Zhang,
- Abstract summary: AutoTraj is a framework that automatically learns TIR by repairing and rewarding tool-use trajectories.<n> Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj.
- Score: 65.10602992874787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches depend on high-quality synthesized trajectories selected by scoring functions and sparse outcome-based rewards, providing limited and biased supervision for learning TIR. To address these challenges, in this paper, we propose AutoTraj, a two-stage framework that automatically learns TIR by repairing and rewarding tool-use trajectories. Specifically, in the supervised fine-tuning (SFT) stage, AutoTraj generates multiple candidate tool-use trajectories for each query and evaluates them along multiple dimensions. High-quality trajectories are directly retained, while low-quality ones are repaired using a LLM (i.e., LLM-as-Repairer). The resulting repaired and high-quality trajectories form a synthetic SFT dataset, while each repaired trajectory paired with its original low-quality counterpart constitutes a dataset for trajectory preference modeling. In the reinforcement learning (RL) stage, based on the preference dataset, we train a trajectory-level reward model to assess the quality of reasoning paths and combine it with outcome and format rewards, thereby explicitly guiding the optimization toward reliable TIR behaviors. Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj in TIR.
Related papers
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z) - Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control [82.30868101940068]
We propose a paradigm in which a model bootstraps its own performance without reliance on external data or teacher models.<n>Our theoretical analysis shows that RSIR acts as a data-driven implicit regularizer, smoothing the optimization landscape.<n>We show that even smaller models benefit, and weak models can generate effective training curricula for stronger ones.
arXiv Detail & Related papers (2026-02-17T15:31:32Z) - TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use [74.47746287181383]
Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks.<n>We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability.
arXiv Detail & Related papers (2025-10-06T07:30:25Z) - Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning [68.89572566071575]
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools.<n>We propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately.<n> Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light.
arXiv Detail & Related papers (2025-09-27T12:53:37Z) - Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection [71.84636717632206]
Unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting Internet of Things (IoT) applications.<n>We propose a joint language model (LLM) to learn effective UAV control policies.<n>LLM-CRDT outperforms benchmark online and offline methods, achieving up to 36.7% higher energy efficiency than current state-of-the-art DT approaches.
arXiv Detail & Related papers (2025-09-17T13:05:08Z) - TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation [27.71948796412585]
Token-level Tool-use Preference Alignment Training Framework (TTPA)<n>TTPA is a training paradigm for constructing token-level tool-use preference datasets.
arXiv Detail & Related papers (2025-05-26T14:06:02Z) - Are You Being Tracked? Discover the Power of Zero-Shot Trajectory
Tracing with LLMs! [3.844253028598048]
This study introduces LLMTrack, a model that illustrates how LLMs can be leveraged for Zero-Shot Trajectory Recognition.
We evaluate the model using real-world datasets designed to challenge it with distinct trajectories characterized by indoor and outdoor scenarios.
arXiv Detail & Related papers (2024-03-10T12:50:35Z) - Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents [41.14201835950814]
Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines.
Previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models.
We argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies.
arXiv Detail & Related papers (2024-02-18T17:10:07Z) - Leveraging Optimal Transport for Enhanced Offline Reinforcement Learning
in Surgical Robotic Environments [4.2569494803130565]
We introduce an innovative algorithm designed to assign rewards to offline trajectories, using a small number of high-quality expert demonstrations.
This approach circumvents the need for handcrafted rewards, unlocking the potential to harness vast datasets for policy learning.
arXiv Detail & Related papers (2023-10-13T03:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.