Fugu-MT 論文翻訳(概要): Understanding Robustness of Model Editing in Code LLMs: An Empirical Study

論文の概要: Understanding Robustness of Model Editing in Code LLMs: An Empirical Study

arxiv url: http://arxiv.org/abs/2511.03182v1
Date: Wed, 05 Nov 2025 04:58:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 18:19:32.330366
Title: Understanding Robustness of Model Editing in Code LLMs: An Empirical Study
Title（参考訳）: コードLLMにおけるモデル編集のロバスト性:実証的研究
Authors: Vinaik Chhetri, A. B Siddique, Umar Farooq,
Abstract要約: 本稿では,5つの最先端モデル編集手法の体系的研究を行う。これらの手法を3つの主要なオープンソースコードLLM、CodeLlama、CodeQwen1.5、DeepSeek-Coderに適用する。インスタント編集はモデル性能を常に劣化させ、構文的妥当性は86ポイントまで低下し、機能的正しさは最高のパフォーマンス設定でも45ポイントまで低下する。
参考スコア（独自算出の注目度）: 1.5624785508022727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in software development. However, while LLMs remain static after pretraining, programming languages and APIs continue to evolve, leading to the generation of deprecated or incompatible code that undermines reliability. Retraining LLMs from scratch to reflect such changes is computationally expensive, making model editing a promising lightweight alternative that updates only a small subset of parameters. Despite its potential, it remains unclear whether model editing yields genuine syntactic and semantic adaptations or merely superficial fixes. In this work, we present a systematic study of five state-of-the-art model editing methods: Constrained Fine-Tuning (FT), GRACE, MEMIT, PMET, and ROME. We apply these methods to three leading open-source code LLMs, CodeLlama, CodeQwen1.5, and DeepSeek-Coder, under controlled API deprecation scenarios. Our evaluation covers both instant and sequential editing settings, using three disjoint evaluation sets designed to assess reliability, generalization, and specificity. We measure model correctness at three levels: successful compilation, partial test case pass, and full test pass. Our findings show that instant edits consistently degrade model performance, with syntactic validity dropping by up to 86 percentage points and functional correctness declining by 45 points even in the best-performing setting. Sequential edits further amplify this degradation, and in some cases, model performance collapses entirely. Across all models, most passing generations relied on workarounds rather than correctly adopting the intended changes, while faulty adoptions that result in test failures or compilation errors were significantly more frequent. Correct adoptions, where the model correctly integrates the intended change, occurred in only about 6% of cases.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ソフトウェア開発でますます使われている。しかし、LLMは事前トレーニング後に静的のままであるが、プログラミング言語とAPIは進化し続けており、非推奨または互換性のないコードが生成され、信頼性が損なわれている。このような変更を反映するために、スクラッチからLLMをリトレーニングすることは、計算コストがかかるため、モデル編集は、少数のパラメータのみを更新する、有望な軽量な代替手段となる。その可能性にもかかわらず、モデル編集が真の統語的・意味的な適応をもたらすか、単に表面的な修正をもたらすかは定かではない。本研究では,5つの最先端モデル編集手法について,制約付ファインチューニング(FT),GRACE,MEMIT,PMET,ROMEの3つを体系的に検討する。制御されたAPI非推奨シナリオの下で,これらの手法を3つの主要なオープンソースコードLLM,CodeLlama,CodeQwen1.5,DeepSeek-Coderに適用する。本評価では、信頼性、一般化、特異性を評価するために設計された3つの不整合評価セットを用いて、インスタントおよびシーケンシャルな編集設定の両方をカバーしている。モデルの正確性は、コンパイル成功、部分テストケースパス、完全テストパスの3つのレベルで測定します。これらの結果から,構文的妥当性は86ポイントまで低下し,機能的正当性は45ポイントまで低下することがわかった。逐次編集は、この劣化をさらに増幅し、場合によっては、モデルパフォーマンスは完全に崩壊する。すべてのモデルで、ほとんどの世代は意図した変更を正しく採用するよりも回避策に頼っていた。モデルが意図した変更を正しく統合する正しい採用は、わずか6%のケースで発生した。

論文の概要: Understanding Robustness of Model Editing in Code LLMs: An Empirical Study

関連論文リスト