Fugu-MT 論文翻訳(概要): LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

論文の概要: LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

arxiv url: http://arxiv.org/abs/2511.13998v1
Date: Mon, 17 Nov 2025 23:57:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.842442
Title: LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
Title（参考訳）: LoCoBench-Agent: 長期ソフトウェアエンジニアリングにおけるLLMエージェントのインタラクティブベンチマーク
Authors: Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang,
Abstract要約: textbfLoCoBench-Agentは,大規模言語モデル(LLM)エージェントを現実的,長期的ソフトウェア工学で評価するための総合的な評価フレームワークである。我々のフレームワークは、LoCoBenchの8000のシナリオを対話型エージェント環境に拡張し、マルチターン会話の体系的評価を可能にする。我々のフレームワークは,8つの特殊なツール(ファイル操作,検索,コード解析)をエージェントに提供し,それを10Kから1Mトークンの範囲で評価する。
参考スコア（独自算出の注目度）: 90.84806758077536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~\cite{qiu2025locobench} assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce \textbf{LoCoBench-Agent}, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long-context performance. Through systematic evaluation of state-of-the-art models, we reveal several key findings: (1) agents exhibit remarkable long-context robustness; (2) comprehension-efficiency trade-off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high-performing agents. As the first long-context LLM agent benchmark for software engineering, LoCoBench-Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.
Abstract（参考訳）: 大規模言語モデル(LLM)が複雑なソフトウェア開発タスクが可能な高度な自律エージェントへと進化するにつれて、それらの実世界の能力を評価することが重要になる。 LoCoBench~\cite{qiu2025locobench}のような既存のベンチマークは、長いコンテキストのコード理解を評価するが、シングルターン評価に重点を置いており、マルチターンのインタラクティブな性質、ツールの使用パターン、実際のコーディングエージェントが必要とする適応推論をキャプチャできない。本稿では,LLMエージェントを現実的,長期的ソフトウェアエンジニアリングワークフローで評価するための総合的な評価フレームワークである \textbf{LoCoBench-Agent} を紹介する。我々のフレームワークは、LoCoBenchの8000のシナリオをインタラクティブなエージェント環境に拡張し、マルチターン会話、ツールの使用効率、エラー復旧、拡張された開発セッション間のアーキテクチャ一貫性の体系的な評価を可能にします。また,9つの指標を包括的・効率的に評価する手法も導入した。我々のフレームワークは,8つの特殊なツール(ファイル操作,検索,コード解析)をエージェントに提供し,それを10Kトークンから1Mトークンまでの範囲で評価し,長コンテキスト性能の正確な評価を可能にする。最先端モデルの体系的評価を通じて,(1)顕著な長期的堅牢性を示すエージェント,(2)包括的探索が包括的だが効率を低下させる負の相関関係を持つエージェント,(3)高パフォーマンスエージェントを識別する戦略ツールの使用パターンによって,モデル間での会話効率は劇的に変化している。ソフトウェアエンジニアリングのための最初のLLMエージェントベンチマークとして、LoCoBench-Agentは、エージェント能力の測定、パフォーマンスギャップの特定、大規模における自律ソフトウェア開発の進歩のための厳格な基盤を確立する。

論文の概要: LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

関連論文リスト