Fugu-MT 論文翻訳(概要): AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

論文の概要: AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

arxiv url: http://arxiv.org/abs/2605.12925v1
Date: Wed, 13 May 2026 03:00:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.77493
Title: AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
Title（参考訳）: AgentLens:SWE-Agent評価におけるラッキーパスの問題を明らかにする
Authors: Priyam Sahoo, Gaurav Mittal, Xiaomin Li, Shengjie Ma, Benjamin Steenhoek, Pingping Lin, Yu Hu,
Abstract要約: 8つのモデルバックエンドから60個のSWEベンチ検証タスクの2,614個のOpenHandsトラジェクトリを評価した。このサブセットで通過する軌道の中で、10.7%はラッキーパスと呼ばれる振る舞いを示す。本稿では,SWEエージェント軌道のプロセスレベル評価フレームワークであるAgentLensを紹介する。
参考スコア（独自算出の注目度）: 11.272830796781925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and release AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the AgentLens-Bench dataset and AgentLens SDK, at https://github.com/microsoft/code-agent-state-trajectories/.
Abstract（参考訳）: ソフトウェアエンジニアリング(SWE)エージェントの評価は、最終パッチがテストに合格するかどうかというバイナリ信号によって支配される。この結果のみの見解は、原則化された解決策とカオス的な試行錯誤プロセスを等価として扱う。この等価性は実証的に偽であることを示す。 8つのモデルバックエンドから60個のSWEベンチ検証タスクの2,614個のOpenHandsトラジェクトリを評価した。これらのうち47はタスクレベルのプロセス参照を構築するのに十分なパストラジェクトリを持ち、1,815のトラジェクトリ評価サブセットを生成する。このサブセットの軌跡の中で、10.7%はラッキーパスと呼ばれる行動を示す:回帰サイクル、ブラインドリトライ、欠落した検証、時間的に乱れた探索、実装、検証。本稿では,SWE-Adntトラジェクトリのプロセスレベル評価フレームワークであるAgentLens-Benchと,品質スコア,ムダ信号,発散点,47タスクレベルのプレフィックスツリーアクセプタ(PTA)参照を付加した1,815トラジェクトリのデータセットであるAgentLens-Benchを紹介する。 AgentLensは、同じタスクのために複数のパスソリューションをマージしてPTA参照を構築し、ツールアイデンティティのみではなく、トラジェクトリ履歴に基づいたエクスプロレーション、実装、検証、オーケストレーションにアクションを割り当てるために、コンテキスト依存のインテントラベルを使用する。 AgentLens-Benchでは、品質スコアがパストラジェクトリをLucky、Solid、Idealティアに分離し、Lucky Passesを5つの繰り返しメカニズムに分解する。 8つのモデルのバックエンド全体で、ラッキーレートは0.5%から23.2%の範囲であり、パスレートではなく、品質スコアでランク付けされた場合、最大5つのランクで移動するモデルもある。 AgentLens-BenchデータセットとAgentLens SDKを含む匿名プロジェクトリポジトリをhttps://github.com/microsoft/code-agent-state-trajectories/でリリースします。

論文の概要: AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

関連論文リスト