Fugu-MT 論文翻訳(概要): Foundation Models as Oracles for Refactoring Correctness Detection

論文の概要: Foundation Models as Oracles for Refactoring Correctness Detection

arxiv url: http://arxiv.org/abs/2605.02096v1
Date: Sun, 03 May 2026 23:31:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.078146
Title: Foundation Models as Oracles for Refactoring Correctness Detection
Title（参考訳）: 正しい検出をリファクタリングするためのOracleとしてのファンデーションモデル
Authors: Rohit Gheyi, Rian Melo, Jonhnanthan Oliveira, Marcio Ribeiro, Baldoino Fonseca,
Abstract要約: 本研究は,Javaプログラムのバグ検出のオーラクルとして機能する基盤モデルの可能性を検討する。本研究の結果から, 基礎モデルの性能はモデルによって異なるものの, この課題に有効であることが示唆された。
参考スコア（独自算出の注目度）: 0.6596954257395425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Refactoring tools in popular Integrated Development Environments (IDEs) can introduce unintended behavioral changes or compilation errors, a persistent challenge that undermines developer trust in automated transformations. Traditional detection approaches rely on handcrafted preconditions, and static and dynamic analyses, yet remain limited in adaptability and can miss subtle correctness issues. This study examines the potential of foundation models to serve as oracles for detecting refactoring bugs in Java programs. We evaluate zero-shot prompting, without task-specific training, across 226 real refactoring bugs collected over more than a decade from widely used Java IDEs (IntelliJ-IDEA, Eclipse, and NetBeans), spanning 47 refactoring types. Our results indicate that foundation models can be effective for this task, although performance varies across models. In the first-run setting, GPT-OSS-20B achieved 80.5% accuracy, while GPT-5.4 reached 93.8%. We also evaluated other open and proprietary models: Gemma-4-31B achieved the strongest result among open models, and Gemini-3.1-Pro-Preview achieved the best overall result among all evaluated models. Metamorphic testing further shows that model predictions are largely consistent under intended semantics-preserving code variations, suggesting that superficial pattern matching may not fully account for the observed behavior. Beyond detection accuracy, foundation models can provide short explanations that may help support developer inspection, operate across refactoring types without explicitly encoded refactoring-specific rules, and may serve as lightweight triage aids in development workflows. Our findings suggest that foundation models can complement traditional refactoring checks by flagging suspicious transformations for developer inspection.
Abstract（参考訳）: 一般的な統合開発環境(IDE)のリファクタリングツールは、意図しない振る舞いの変更やコンパイルエラーをもたらす可能性がある。従来の検出手法は手作りの事前条件と静的および動的解析に依存しているが、適応性には制限があり、微妙な正しさの問題を見逃す可能性がある。本研究は,Javaプログラムのリファクタリングバグを検出するためのオーラクルとして機能する基盤モデルの可能性を検討する。 47のリファクタリングタイプにまたがる、広く使用されているJava IDE(IntelliJ-IDEA、Eclipse、NetBeans)から10年以上にわたって収集された226の実際のリファクタリングバグに対して、タスク固有のトレーニングなしでゼロショットプロンプトを評価しました。本研究の結果から, 基礎モデルの性能はモデルによって異なるものの, この課題に有効であることが示唆された。 GPT-OSS-20Bは80.5%の精度を達成し、GPT-5.4は93.8%に達した。 Gemma-4-31Bはオープンモデルの中で最強の結果を、Gemini-3.1-Pro-Previewはすべての評価モデルの中で最高の結果を得た。メタモルフィックテストはさらに、モデル予測が意図されたセマンティクスを保存するコードバリエーションの下では概ね一貫性があることを示し、表面的なパターンマッチングが観察された振る舞いを完全に説明できないことを示唆している。検出精度以外にも、ファンデーションモデルは、開発者インスペクションをサポートするのに役立つ短い説明を提供し、リファクタリング固有のルールを明示的にエンコードすることなく、リファクタリングタイプを横断的に運用し、開発ワークフローにおいて軽量なトリアージ補助を提供することができる。この結果から, ファンデーションモデルが開発者の検査に不審な変換を通知することで, 従来のリファクタリングチェックを補完できる可能性が示唆された。

論文の概要: Foundation Models as Oracles for Refactoring Correctness Detection

関連論文リスト