Fugu-MT 論文翻訳(概要): Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

論文の概要: Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

arxiv url: http://arxiv.org/abs/2604.00594v1
Date: Wed, 01 Apr 2026 07:59:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.894179
Title: Agent psychometrics: Task-level performance prediction in agentic coding benchmarks
Title（参考訳）: エージェント・サイコメトリックス:エージェント・コーディング・ベンチマークにおけるタスクレベルのパフォーマンス予測
Authors: Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, Kaivalya Hariharan,
Abstract要約: 本稿では,エージェントプログラミング体制に合わせて,個々のタスクにおける成功や失敗を予測する枠組みを提案する。我々のアプローチは、イシューステートメント、リポジトリコンテキスト、ソリューション、テストケースなど、タスクから抽出された豊富な機能を備えたアイテム応答理論(IRT)を拡張します。
参考スコア（独自算出の注目度）: 24.348135523715815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is compounded by current practice: agent performance is typically measured by aggregate pass rates on benchmarks, but single-number metrics obscure the diversity of tasks within a benchmark. We present a framework for predicting success or failure on individual tasks tailored to the agentic coding regime. Our approach augments Item Response Theory (IRT) with rich features extracted from tasks, including issue statements, repository contexts, solutions, and test cases, and introduces a novel decomposition of agent ability into LLM and scaffold ability components. This parameterization enables us to aggregate evaluation data across heterogeneous leaderboards and accurately predict task-level performance for unseen benchmarks, as well as unseen LLM-scaffold combinations. Our methods have practical utility for benchmark designers, who can better calibrate the difficulty of their new tasks without running computationally expensive agent evaluations.
Abstract（参考訳）: LLMベースのコーディングは、静的な単一ステップのコード生成から、ツールや環境とのマルチステップのエージェントインタラクションへとシフトする。エージェントのパフォーマンスは通常、ベンチマークの集合パスレートによって測定されますが、シングルナンバーのメトリクスは、ベンチマーク内のタスクの多様性を曖昧にします。本稿では,エージェントプログラミング体制に合わせて,個々のタスクにおける成功や失敗を予測する枠組みを提案する。提案手法は,課題ステートメントやリポジトリコンテキスト,ソリューション,テストケースなどのタスクから抽出した豊富な機能を備えた項目応答理論(IRT)を拡張し,エージェント能力のLLMおよび足場能力コンポーネントへの新たな分解を導入する。このパラメータ化により、不均一なリーダーボードにまたがる評価データを集約し、未確認のベンチマークのタスクレベルの性能を正確に予測し、LCM-スキャフォールドの組み合わせを予測できる。提案手法は,計算コストのかかるエージェント評価を行わずに,新しいタスクの難易度を精度よく調整できる,ベンチマーク設計者のための実用性を備えている。

論文の概要: Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

関連論文リスト