Related papers: TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

URL: http://arxiv.org/abs/2602.21230v1
Date: Thu, 05 Feb 2026 13:28:57 GMT
Title: TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents
Authors: Yanyu Chen, Jiyue Jiang, Jiahong Liu, Yifei Zhang, Xiao Guo, Irwin King,
Abstract summary: Trajectory-Aware Comprehensive Evaluation (TRACE) is a framework that holistically assesses the entire problem-solving trajectory.<n>Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity.
Score: 51.30998248590416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a "high-score illusion" that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the "high-score illusion", we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent's latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.

Related papers

DREAM: Deep Research Evaluation with Agentic Metrics [21.555357444628044]
We propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that makes evaluation itself agentic.<n> DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent.<n>Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks.
arXiv Detail & Related papers (2026-02-21T19:14:31Z)
Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z)
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents [22.781523439717223]
A proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory.<n>We introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance.<n>Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner.
arXiv Detail & Related papers (2025-10-03T09:19:15Z)
CORE: Full-Path Evaluation of LLM Agents Beyond Final State [2.0391237204597368]
Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state.<n>We propose a framework based on deterministic finite automata that encodes tasks as sets of valid tool-use paths.<n>We introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency.
arXiv Detail & Related papers (2025-09-25T10:49:35Z)
Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference [8.823529310904162]
Multi-agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is hampered by the challenge of failure attribution.<n>We introduce the first failure attribution framework for MAS grounded in multi-granularity causal inference.
arXiv Detail & Related papers (2025-09-10T15:22:00Z)
Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects [0.6087817758152709]
We present a systematic study of personality control using the Big Five traits.<n>Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL.<n>Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs.
arXiv Detail & Related papers (2025-09-05T04:19:15Z)
Towards Evaluting Fake Reasoning Bias in Language Models [47.482898076525494]
We show that models favor the surface structure of reasoning even when the logic is flawed.<n>We introduce THEATER, a benchmark that systematically investigates Fake Reasoning Bias (FRB)<n>We evaluate 17 advanced Large Language Models (LRMs) on both subjective DPO and factual datasets.
arXiv Detail & Related papers (2025-07-18T09:06:10Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task. LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced. Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z)
Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics. We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection [75.80075054706079]
We propose a weakly- and semi-supervised object detection framework (WSSOD) An agent detector is first trained on a joint dataset and then used to predict pseudo bounding boxes on weakly-annotated images. The proposed framework demonstrates remarkable performance on PASCAL-VOC and MSCOCO benchmark, achieving a high performance comparable to those obtained in fully-supervised settings.
arXiv Detail & Related papers (2021-05-21T11:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.