Fugu-MT 論文翻訳(概要): Dissecting model behavior through agent trajectories

論文の概要: Dissecting model behavior through agent trajectories

arxiv url: http://arxiv.org/abs/2606.17454v2
Date: Wed, 17 Jun 2026 04:51:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 13:57:35.216419
Title: Dissecting model behavior through agent trajectories
Title（参考訳）: エージェント・トラジェクトリによるモデル行動の分離
Authors: Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras,
Abstract要約: 私たちはインテント・エグゼクティブのギャップを形式化し、モデルが意図するものと、ハーネスが実行しているものとのミスマッチと、その逆です。このハーネスモデルアライメントの効果を説明するために,SSA(Simple Strands Agent)と呼ばれるシンプルでカスタマイズ可能なハーネスを開発した。 i) 一般的なエージェントベンチマークにおいて,多種多様なモデルプロジェクタファミリーが報告したpass@1のパフォーマンスを再現または改善し, (ii) SSAが生成した128k軌道の解析に基づいて構築する。
参考スコア（独自算出の注目度）: 15.811597127775812
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.
Abstract（参考訳）: AIエージェントのパフォーマンスは単なるモデリングの問題ではなく、基本的にシステムの問題である。モデルの高度な能力はエージェントハーネスによって実現される。したがって、モデル仮定とハーネス動作のギャップは、モデルの完全な能力がエージェントのパフォーマンスに変換されるのを、容易に防ぐことができる。私たちはこれを 'intent-execution' ギャップとして形式化しています。この意図と実行のギャップを最小化することは、ツールや実行ループといった設計の他の側面と同じくらい重要である、と私たちは主張する。このハーネスモデルアライメントの効果を説明するため,シンプルでカスタマイズ可能なハーネス「Simple Strands Agent」(SSA)を開発した。 SSAは、さまざまなモデルファミリ(Claude、Gemini、GPT、Grok、Qwenなど)にまたがる一般的なパターンの大部分と、少数のモデル固有の好みを見つけることを目的としている。私たちは2つの貢献をします。 (i)一般的なエージェントベンチマーク(SWE-Pro, SWE-Verified, Terminal-Bench-2)において、多種多様なモデルプロジェクタファミリーが報告したpass@1の性能を再現または改善し、また、 (II) SSAが生成した128kの軌道解析に基づいて、フロンティアモデル全体でも比較的高い傾向にあるpass@1数を超える数を求める。エージェントトラジェクトリをコード状態空間で表現することにより、問題解決行動のモデルレベルでの違いを観察する。編集頻度、テストアクティビティ、フェーズ移行といったより詳細なメトリクスは、個々のモデルが自律的な問題解決のさまざまな段階にわたって、どのように労力を割り当てるかを明らかにします。

論文の概要: Dissecting model behavior through agent trajectories

関連論文リスト