Fugu-MT 論文翻訳(概要): TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

論文の概要: TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

arxiv url: http://arxiv.org/abs/2602.02196v2
Date: Tue, 03 Feb 2026 04:28:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-04 13:28:03.742391
Title: TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents
Title（参考訳）: TIDE:LLM剤の試験時間改善の軌道ベース診断評価
Authors: Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin,
Abstract要約: 自律型LLMエージェントの最近の進歩は、環境との反復的相互作用によって性能を向上させる能力を示している。本稿では,TTIを3つの包括的かつ相互接続的な次元に分解するエージェント非依存および環境非依存のフレームワークであるテスト時間改善診断評価(TIDE)を提案する。
参考スコア（独自算出の注目度）: 43.376952807616256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.
Abstract（参考訳）: 自律型LLMエージェントの最近の進歩は、環境との反復的相互作用によって性能を向上させる能力を示している。このパラダイムをテスト時間改善(TTI)と定義します。しかし、TTIが成功するか失敗するかというメカニズムはいまだよく理解されておらず、既存の評価基準では、タスク最適化の効率、誤動作後の振る舞い適応、タスク完了のためのワーキングメモリの具体的な有用性は把握できない。これらのギャップに対処するために,TTIを3つの包括的・相互接続的な次元に分解するエージェントに依存しない,環境に依存しないフレームワークであるテスト時間改善診断評価(TIDE)を提案する。本フレームワークは,(1)タスク完了の時間的ダイナミクス,(2)再帰的ループ動作によるパフォーマンスの制約,(3)蓄積メモリによるパフォーマンスの制約を計測する。さまざまなエージェントや環境にわたる広範な実験を通じて、TIDEはエージェントのパフォーマンス向上には内部推論のスケーリング以上のものが必要だと強調し、エージェントと環境の間のインタラクションのダイナミクスを明示的に最適化するよう要求する。

論文の概要: TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

関連論文リスト