Fugu-MT 論文翻訳(概要): Counterfactual Trace Auditing of LLM Agent Skills

論文の概要: Counterfactual Trace Auditing of LLM Agent Skills

arxiv url: http://arxiv.org/abs/2605.11946v1
Date: Tue, 12 May 2026 10:56:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.808668
Title: Counterfactual Trace Auditing of LLM Agent Skills
Title（参考訳）: LLMエージェントスキルの非現実的トレース監査
Authors: Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, Xiyang Hu,
Abstract要約: スキルがエージェントの振る舞いをどのように変化させるかを測定するためのフレームワークを紹介します。 SWE-Skills-Bench上のCTAを49のソフトウェアエンジニアリングタスクでClaudeでインスタンス化する。パスレートは平均で0.3ポイントしか変化せず、集合効果はほとんどなかった。
参考スコア（独自算出の注目度）: 38.396785087675774
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.
Abstract（参考訳）: 大規模言語モデルエージェントは、エージェントスキルでますます強化されている。現在の技術評価方法はまだ限られている。ほとんどのベンチマークでは、スキルがアタッチされた前後でのみパスレートを報告しており、エージェントの動作に対するブラックボックスの変更として扱う。本稿では,エージェントの動作がどう変化するかを測定するためのフレームワークとして,CTA(Courerfactual Trace Auditing)を紹介する。 CTAは、それぞれがスキルエージェントトレースと、同じタスクでスキルエージェントトレースをペアリングし、両方のトレースを目標指向のフェーズに分割し、フェーズを調整し、構造化されたスキル影響パターン(SIP)アノテーションを出力する。これらのアノテーションは、タスクの結果だけでなく、スキルの行動効果を記述する。 SWE-Skills-Bench上のCTAを49のソフトウェアエンジニアリングタスクでClaudeでインスタンス化する。その結果,明確な評価ギャップが明らかになった。パスレートは平均で0.3ポイントしか変化せず、集合効果はほとんどなかった。しかし、CTAは同一のトレースにまたがる522のSIPインスタンスを特定し、パスレートがほとんど変化していない場合でも、エージェントの振る舞いを実質的に再現する技術を示している。監査はまた、リテラルテンプレートのコピー、タスクアーチファクトの生成、過剰な計画、タスクリカバリなど、パスレートが検出できないいくつかの繰り返し効果を分離する。 3つの発見がある。第一に、高いベースラインタスクは観察されたスキル効果のほとんどを含むが、そのパスレートは既に飽和しており、そのためこれらの効果を反映できない。第二に、適度なベースライン性能を持つタスクは、最も回復可能なゲインを示すが、トークンコストがかなり高い場合が多い。第3に、支配的なSIPタイプはベースラインバケットによって識別できる: 表面のアンカーは天井のタスクで、エッジケースのプロンプトはミッドレンジとフロアのタスクで、最も一般的である。これらの規則性は、非公式な障害モードの観察を再現可能な行動測定に変換する。

論文の概要: Counterfactual Trace Auditing of LLM Agent Skills

関連論文リスト