Fugu-MT 論文翻訳(概要): Agent trajectories as programs: fingerprinting and programming coding-agent behavior

論文の概要: Agent trajectories as programs: fingerprinting and programming coding-agent behavior

arxiv url: http://arxiv.org/abs/2606.16988v1
Date: Mon, 15 Jun 2026 17:28:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 18:36:05.100559
Title: Agent trajectories as programs: fingerprinting and programming coding-agent behavior
Title（参考訳）: プログラムとしてのエージェント・トラジェクトリ:フィンガープリントとプログラミング・コーディング・エージェントの振る舞い
Authors: Hamidah Oderinwale,
Abstract要約: ベンチマークスコアは、エージェントが正しいことを教えてくれます。本研究では,モデル,タスク,アプローチが異なる状況下で,エージェントを手続き的に比較する手法を提案する。これらの手続き的シグネチャに対する調査では、正しいエージェントに対して85.7%の精度で見当たらない軌跡があり、タスク間のリークを制御している。
参考スコア（独自算出の注目度）: 0.49316866264940024
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We develop procedural representations for agent problem-solving procedures with an emergent vocabulary induction technique that is meant to be maximally compressive to avoid surface-level variation while being expressive enough to unveil the quirks of the models' patterns. We apply our framework to the software engineering evaluation dataset SWE-Bench to study the structural distinctness of agent trajectories and find that behavior is most similar between models from similar release periods and those that are distilled from one another (e.g., a distilled student model and its teacher have a Jensen-Shannon divergence of 0.25, about half the distance between other model pairs). As more models saturate evaluations, we believe that it will be important to probe model behavior along more holistic dimensions than success rates alone. We introduce ProcGrep, a library for auditing and evaluating agents for how they approach tasks at a procedural level given their traces in a top-down fashion. We believe this work has a range of applications to help developers work with and program coding agents, such as task-aware model routing, agent monitoring, and finer-grained cost analysis.
Abstract（参考訳）: ベンチマークスコアは、エージェントが正しいことを教えてくれます。本研究では,モデル,タスク,アプローチが異なる状況下で,エージェントを手続き的に比較する手法を提案する。これらの手続き的シグネチャに対する調査は、正しいエージェントに対して、85.7%の精度で、タスク間の漏洩を制御している、見当たらない軌跡を特徴付けている。モデルパターンのクォークを明らかにするのに十分な表現性を持ちながら、表面レベルの変動を避けるために、最大圧縮を意図した創発的な語彙誘導技術を用いて、エージェント問題解決手順の手続き表現を開発する。我々は,ソフトウェア工学評価データセットSWE-Benchに適用し,エージェント軌跡の構造的相違について検討し,類似したリリース期間と蒸留期間のモデル間での挙動が最もよく似ていることを確認する(例えば,蒸留した学生モデルとその教師は,他のモデルペアの約半分であるJensen-Shannonの偏差が0.25である)。モデルが飽和するにつれて、我々は、成功率のみよりも、より全体論的次元に沿ってモデル行動を研究することが重要であると信じている。 ProcGrepはプロシージャレベルでタスクにどのようにアプローチするかを監査・評価するためのライブラリである。この作業には、タスク対応モデルルーティング、エージェント監視、よりきめ細かいコスト分析など、開発者がコーディングエージェントと連携し、プログラムするのに役立つ、さまざまなアプリケーションがある、と私たちは信じています。

論文の概要: Agent trajectories as programs: fingerprinting and programming coding-agent behavior

関連論文リスト