Fugu-MT 論文翻訳(概要): Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

論文の概要: Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

arxiv url: http://arxiv.org/abs/2604.05100v1
Date: Mon, 06 Apr 2026 18:59:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.45083
Title: Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
Title（参考訳）: 編集、検証:指示されたコード編集ベンチマークの実証監査
Authors: Amir M. Ebrahimi, Gopi Krishnan Rajbahadur,
Abstract要約: 命令付きコード編集は、現実世界のコーディングアシスタントのインタラクションの約19%を占める。 150以上のコード関連ベンチマークから、指示されたコード編集をターゲットとするCanItEditとEDIT-Benchの2つのみが見つかった。
参考スコア（独自算出の注目度）: 2.5424331328233203
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Instructed code editing, where an LLM modifies existing code based on a natural language instruction, accounts for roughly 19% of real-world coding assistant interactions. Yet very few benchmarks directly evaluate this capability. From a survey of over 150 code-related benchmarks, we find that only two, CanItEdit and EDIT-Bench, target instructed code editing with human-authored instructions and test-based evaluation. We audit both by comparing their programming languages, edit intents, and application domains against distributions observed in the wild (Copilot Arena, AIDev, GitHub Octoverse), and by measuring test counts, statement coverage, and test scope across all 213 problems. Both benchmarks concentrate over 90\% of evaluation on Python while TypeScript, GitHub's most-used language, is absent. Backend and frontend development, which together constitute 46% of real-world editing activity, are largely missing, and documentation, testing, and maintenance edits (31.4% of human PRs) have zero representation. Both benchmarks have modest test counts (CanItEdit median 13, EDIT-Bench median 4), though CanItEdit compensates with near-complete whole-file coverage and fail-before/pass-after validation. 59\% of EDIT-Bench's low-coverage suites would not detect modifications outside the edit region. EDIT-Bench has 15 problems that are not solved by any of 40 LLMs and 11 of these problems trace failures to poor benchmark artifacts rather than model limitations. Further, 29% of EDIT-Bench problems and 6% of CanItEdit problems share a codebase with at least one other problem within the benchmark. In summary, these benchmarks measure a narrower construct than deployment decisions require. We therefore propose six empirically grounded desiderata and release all audit artifacts so the community can build instructed code-editing benchmarks whose scores reliably reflect real-world editing capability.
Abstract（参考訳）: LLMが自然言語命令に基づいて既存のコードを変更する命令コード編集は、現実世界のコーディングアシスタントのインタラクションの約19%を占める。しかし、この能力を直接評価するベンチマークはごくわずかである。 150以上のコード関連ベンチマークを調査した結果,CanItEdit と EDIT-Bench の2つのみが,人間の指示によるコード編集とテストベースの評価を目標としていることがわかった。私たちは、彼らのプログラミング言語を比較し、意図を編集し、アプリケーションドメインを野生で観察されたディストリビューション(Copilot Arena、AIDev、GitHub Octoverse)と比較し、テスト数、ステートメントカバレッジ、テストスコープを213のすべての問題で測定することで監査します。どちらのベンチマークもPythonに対する評価の90%以上に集中しているが、GitHubで最も使われている言語であるTypeScriptは欠落している。バックエンドとフロントエンドの開発は、共に現実世界の編集活動の46%を占めており、ドキュメント、テスト、メンテナンスの編集(人間のPRの31.4%)がゼロである。どちらのベンチマークも控えめなテスト数(CanItEdit、中央値13、EDIT-Bench、中央値4)を持つが、CanItEditは、ほぼ完全な全ファイルカバレッジとフェール前/パス後バリデーションを補完する。 EDIT-Benchの低カバレッジスイートの99%は、編集領域外の修正を検出できない。 EDIT-Benchには、40のLLMのいずれかで解決されない15の問題がある。さらに、EDIT-Bench問題の29%とCanItEdit問題の6%は、ベンチマーク内の少なくとも1つの他の問題とコードベースを共有している。まとめると、これらのベンチマークはデプロイメントの決定よりも狭い構造を計測します。そこで我々は,実世界の編集能力を確実に反映したコード編集ベンチマークをコミュニティが構築できるように,実証的な6つのデシデラタを提案し,すべての監査成果物をリリースする。

関連論文リスト

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits [72.23150343093447]
本稿では,実環境におけるコード編集機能の評価のためのベンチマークであるEDIT-Benchを紹介する。 EDIT-Benchは545の問題、複数の自然言語およびプログラミング言語、および様々な現実世界のユースケースからなる。モデルの性能は、ユーザ命令のカテゴリによって異なります。
論文参考訳（メタデータ） (2025-11-06T16:05:28Z)
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases [58.411135609139855]
タスク完了のための「ショートカット」は、大規模言語モデルの信頼性評価と展開に重大なリスクをもたらす。我々は,LLMエージェントがテストケースを利用するための正当性を測定するベンチマークフレームワークであるImpossibleBenchを紹介する。実践的なフレームワークとして、ImpossibleBenchは単なる評価ではなく、汎用的なツールである。
論文参考訳（メタデータ） (2025-10-23T06:58:32Z)
A Benchmark for Localizing Code and Non-Code Issues in Software Projects [26.511673758202267]
46の人気のあるGitHub Pythonプロジェクトから1,100のイシューのデータセットであるMULocBenchを紹介します。既存のベンチマークと比較すると、MULocBenchはイシュータイプ、根本原因、ロケーションスコープ、ファイルタイプに大きな多様性を提供する。このベンチマークを用いて、最先端のローカライズ手法と5つのLCMベースのプロンプト戦略の性能を評価する。
論文参考訳（メタデータ） (2025-09-26T06:05:20Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)。従来の静的ベンチマークとは異なり、SwingArenaはLLMをイテレーションとして組み合わせて、テストケースを作成し、継続的インテグレーション(CI)パイプラインを通じてパッチを検証するパッチとレビュアーを生成することで、ソフトウェアのコラボレーションプロセスをモデル化する。
論文参考訳（メタデータ） (2025-05-29T18:28:02Z)
UTFix: Change Aware Unit Test Repairing using LLM [24.12850207529614]
UTFixは, 焦点法が変化した場合に, 単体検査を修復するための新しい手法である。このアプローチでは,静的コードスライスや動的コードスライス,障害メッセージなどのコンテキスト情報を提供することで,言語モデルを利用してユニットテストを修復する。私たちの知る限りでは、これはPythonプロジェクトの進化におけるユニットテストに焦点を当てた初めての総合的な研究です。
論文参考訳（メタデータ） (2025-03-19T06:10:03Z)
Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications [4.751608548909266]
FineEditは、コンテキスト対応のテキスト修正のために明示的に訓練された特殊な編集モデルである。 FineEditはシングルターン編集で最先端のモデルより優れており、Llama-3.2-3Bより30%も上回り、Mistral-7B-OpenOrcaのパフォーマンスを40%以上上回っている。
論文参考訳（メタデータ） (2025-02-19T01:41:44Z)
The Mirage of Model Editing: Revisiting Evaluation in the Wild [70.17413507444704]
我々は、広く使われている質問応答(QA)データセットに対応する新しいベンチマークであるQAEditと、タスクに依存しない評価フレームワークであるWILDを紹介する。単一の編集実験により、現在行われている編集手法は、以前報告したよりもかなり悪い結果が得られた。
論文参考訳（メタデータ） (2025-02-16T15:57:55Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
本稿では,テストケースの生成と完成におけるモデルの能力を評価するためのベンチマークCLOVERを提案する。ベンチマークはタスク間でのコード実行のためにコンテナ化されています。
論文参考訳（メタデータ） (2025-02-12T21:42:56Z)
Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions [11.327913840111378]
Defects4J-NL2Fixは、人気のあるDefects4Jデータセットから283のJavaプログラムのデータセットで、バグ修正の高レベルな記述を付加します。本研究は,この課題に対するいくつかの最先端LCMの性能を実証的に評価する。
論文参考訳（メタデータ） (2023-04-07T18:58:33Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。