Related papers: Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

URL: http://arxiv.org/abs/2510.02837v1
Date: Fri, 03 Oct 2025 09:19:15 GMT
Title: Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
Authors: Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park,
Abstract summary: A proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory.<n>We introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance.<n>Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner.
Score: 22.781523439717223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent's trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

Related papers

TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents [51.30998248590416]
Trajectory-Aware Comprehensive Evaluation (TRACE) is a framework that holistically assesses the entire problem-solving trajectory.<n>Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity.
arXiv Detail & Related papers (2026-02-05T13:28:57Z)
Verified Critical Step Optimization for LLM Agents [67.05296684575445]
Critical Step Optimization focuses preference learning on verified critical steps.<n>Method starts from failed policy trajectories rather than expert demonstrations.<n>Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline.
arXiv Detail & Related papers (2026-02-03T11:41:02Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z)
Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis [5.935239028627343]
We introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization.<n>Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits.<n>By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable.
arXiv Detail & Related papers (2025-10-18T20:21:09Z)
JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer [19.09571232466437]
We propose Agent-as-Interviewer, a dynamic evaluation paradigm for large language models (LLMs)<n>Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to invoke knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation.<n>We develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool and uses difficulty scoring as strategy guidance.
arXiv Detail & Related papers (2025-09-02T08:52:16Z)
MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning [33.009759731505746]
Complex tasks involving tool integration pose significant challenges for Large Language Models.<n> Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic benchmarks.<n>We propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory.
arXiv Detail & Related papers (2025-05-27T03:37:33Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.