Fugu-MT 論文翻訳(概要): Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction

論文の概要: Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction

arxiv url: http://arxiv.org/abs/2509.06429v1
Date: Mon, 08 Sep 2025 08:23:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:04.017069
Title: Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction
Title（参考訳）: 自動バグ注入・補正における大規模言語モデルの不安定性の解析
Authors: Mehmet Bilal Er, Nagehan İlhan, Umut Kuran,
Abstract要約: 大規模言語モデル(LLM)は、ソフトウェア工学のタスクで使用される。同じ入力で異なるタイミングで実行すると、根本的に異なるコードを生成することができる。本研究では、コードバグの修正に関して、LLMがいかに不安定かを検討する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of Large Language Models (LLMs) in software engineering tasks is growing, especially in the areas of bug fixing and code generation. Nevertheless, these models often yield unstable results; when executed at different times with the same input, they can generate radically different code. The consistency of LLMs in bug-fixing tasks has not yet been thoroughly assessed, despite the fact that this instability has typically been discussed in the literature in relation to code generation. The purpose of this study is to look into how unstable an LLM like ChatGPT is when it comes to fixing code bugs. We examine the structural, syntactic, and functional variations among several fix recommendations made in response to the same prompt using code samples with various error types. Additionally, we assess how instability is affected by the temperature settings (0, 0.5, and 1) used for the model's deterministic operation. For a total of 20 problems in the experimental analysis, the model produced three fix suggestions at each temperature value, comparing nine distinct outputs for each problem. The Syntax Similarity and Output Equivalence Rate (OER) metrics were used to assess the outputs' structural and functional consistency. The results demonstrate that the model's outputs become much more unstable and variable as the temperature rises, with high temperatures showing especially high rates of functional failure. According to syntax similarity analyses, the suggested fixes show notable structural differences at high temperatures but are fairly similar at low temperatures. The purpose of this study is to provide important methodological insights into how LLM-based error correction systems can be applied more consistently in software development processes while also casting doubt on their dependability.
Abstract（参考訳）: ソフトウェアエンジニアリングタスクにおけるLarge Language Models (LLM)の使用は、特にバグ修正やコード生成の分野で増加している。しかし、これらのモデルは不安定な結果をもたらすことが多く、同じ入力で異なるタイミングで実行されると、根本的に異なるコードを生成することができる。バグ修正タスクにおけるLLMの一貫性はまだ十分に評価されていないが、この不安定性は典型的にはコード生成に関する文献で議論されている。この研究の目的は、コードバグの修正に関して、ChatGPTのようなLLMがいかに不安定であるかを調べることである。様々なエラー型を持つコードサンプルを用いて、同じプロンプトに応答して行われたいくつかの修正提案のうち、構造的、構文的、機能的変異について検討する。さらに, モデル決定操作に用いる温度設定 (0, 0.5, 1) が不安定性に与える影響を評価する。実験結果の合計20問題に対して, モデルでは各温度値に対して3つの修正提案を行い, それぞれ9つの異なる出力を比較した。 OER(Syntax similarity and Output Equivalence Rate)メトリクスは、出力の構造的および機能的整合性を評価するために用いられた。その結果、温度上昇に伴ってモデル出力がより不安定で変動し、特に高温では機能不全の発生率が高くなることが示された。構文類似性分析によると, 提案した修正は高温では顕著な構造的差異を示すが, 低温では比較的類似している。本研究の目的は,LCMに基づく誤り訂正システムがソフトウェア開発プロセスにおいてより一貫した適用が可能でありながら,信頼性に疑問を呈する上で重要な方法論的洞察を提供することである。

論文の概要: Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction

関連論文リスト