Fugu-MT 論文翻訳(概要): EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

論文の概要: EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

arxiv url: http://arxiv.org/abs/2511.04486v1
Date: Thu, 06 Nov 2025 16:05:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.492168
Title: EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits
Title（参考訳）: EDIT-Bench: LLMの能力を評価して実世界の命令されたコード編集を実行する
Authors: Wayne Chi, Valerie Chen, Ryan Shar, Aditya Mittal, Jenny Liang, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue,
Abstract要約: 本稿では,実環境におけるコード編集機能の評価のためのベンチマークであるEDIT-Benchを紹介する。 EDIT-Benchは545の問題、複数の自然言語およびプログラミング言語、および様々な現実世界のユースケースからなる。モデルの性能は、ユーザ命令のカテゴリによって異なります。
参考スコア（独自算出の注目度）: 72.23150343093447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. EDIT-Bench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EDIT-Bench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EDIT-Bench is a challenging set of problems where only 5 models score over 60%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11%, indicating the importance of evaluating with realistic context.
Abstract（参考訳）: 命令付きコード編集では、LLMがユーザーの命令に基づいて開発者の既存のコードを直接修正するが、AIコーディングアシスタントで広く使われている対話モードになりつつある。しかし、この能力を直接評価するベンチマークはほとんどなく、現在のデータセットは、しばしば人工的なソースに依存している。我々は,実世界の利用に根ざしたLLMコード編集機能,すなわち野生で収集されたユーザインストラクションとコードコンテキストを評価するためのベンチマークであるEDIT-Benchを紹介する。 EDIT-Benchは545の問題、複数の自然言語とプログラミング言語、エラーの解決から機能追加まで、さまざまな現実世界のユースケースで構成されている。 EDIT-Benchは、ユーザ命令に加えて、コードコンテキスト、ハイライトされたコード、カーソル位置を理解する必要があるコンテキスト依存の問題を導入している。我々は40種類のLCMを評価し、EDIT-Benchは5モデルしかスコアが60%を超えない難題であることを示した。モデルの性能は、ユーザ命令のカテゴリによって異なります。さらに,コンテキスト情報の変化がタスク成功率に大きく影響し,パフォーマンスが最大11%まで変化していることから,現実的なコンテキストによる評価の重要性が示唆された。

論文の概要: EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

関連論文リスト