Fugu-MT 論文翻訳(概要): LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

論文の概要: LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

arxiv url: http://arxiv.org/abs/2509.09614v1
Date: Thu, 11 Sep 2025 16:55:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-12 16:52:24.483319
Title: LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
Title（参考訳）: LoCoBench: 複雑なソフトウェア工学における長期的大規模言語モデルのベンチマーク
Authors: Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang,
Abstract要約: LoCoBenchは、現実的で複雑なソフトウェア開発シナリオにおいて、長いコンテキストのLLMを評価するために特別に設計されたベンチマークである。ベンチマークでは,10言語にまたがって8000の評価シナリオを体系的に生成する。 LoCoBenchは8つのタスクカテゴリを導入し、重要なコンテキスト理解機能をキャプチャしている。
参考スコア（独自算出の注目度）: 85.58151741052616
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.
Abstract（参考訳）: コンテキストウィンドウが数百万のトークンにまで拡張された長いコンテキスト言語モデルの出現は、洗練されたコード理解とソフトウェア開発評価の新しい機会を生み出しました。提案するLoCoBenchは,現実的かつ複雑なソフトウェア開発シナリオにおいて,LLMの長文評価に特化して設計された総合ベンチマークである。単一機能補完やショートコンテクストタスクにフォーカスする既存のコード評価ベンチマークとは異なり、LoCoBenchはコードベース全体の理解、複数のファイル間の推論、大規模ソフトウェアシステム間のアーキテクチャ一貫性の維持を必要とする長期コンテキスト機能に対する重要な評価ギャップに対処する。我々のベンチマークでは、10のプログラミング言語で8000のシナリオを体系的に生成し、コンテキスト長は10Kから100Mのトークンにまたがる。 LoCoBench氏は、アーキテクチャ理解、クロスファイルリファクタリング、マルチセッション開発、バグ調査、機能実装、コード理解、統合テスト、セキュリティ分析といった、重要な長期コンテキスト機能を捉える8つのタスクカテゴリを紹介した。 5フェーズのパイプラインを通じて、LEMに対して前例のない規模で複雑なコードベースを推論する、多種多様な高品質なシナリオを作成します。筆者らは,LoCoBench Score(LCBS)に合計8つの新しい評価指標を含む,4次元にわたる17の指標を備えた総合評価フレームワークを導入する。最先端の長期コンテキストモデルに対する我々の評価は、複雑なソフトウェア開発における長期コンテキストの理解が、より注意を要する重要な未解決課題であることを示している。 LoCoBenchは、https://github.com/SalesforceAIResearch/LoCoBench.comでリリースされた。

論文の概要: LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

関連論文リスト