Fugu-MT 論文翻訳(概要): How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

論文の概要: How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

arxiv url: http://arxiv.org/abs/2604.10508v1
Date: Sun, 12 Apr 2026 07:51:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.063879
Title: How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Title（参考訳）: どれくらいの試行が必要か? LLMコード生成におけるモデルスケールとベンチマークの反復的自己修復
Authors: Johin Johny Arimbur,
Abstract要約: 7つの大言語モデルにわたる反復的自己修復について検討する。 HumanEvalとMBPPは最大5回の試行で衛生化され、自己修復はパスレートを普遍的に改善する。エラータイプの分析では、アサーションエラーは45%で修正するのが最も難しいが、構文や名前のエラーはかなり高い速度で修正される。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models frequently fail to produce correct code on their first attempt, yet most benchmarks evaluate them in a single-shot setting. We investigate iterative self-repair (feeding execution errors back to the model for correction) across seven models spanning three families and both open-weight and proprietary providers: Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (MoE, 16 experts), Llama 4 Maverick (MoE, 128 experts), Qwen3 32B, Gemini 2.5 Flash, and Gemini 2.5 Pro. On HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, self-repair universally improves pass rates: +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Gemini 2.5 Flash achieves the highest final pass rates (96.3% HumanEval, 93.8% MBPP). Most gains concentrate in the first two rounds.Error-type analysis shows assertion errors (logical mistakes) are the hardest to repair at ~45%, while syntax and name errors are repaired at substantially higher rates, connecting to broader findings on the limits of LLM self-correction. Prior work found that weaker models fail at self-repair or require fine-tuning; we show that modern instruction-tuned models succeed with prompting alone, even at 8B scale. We also provide the first comparison of dense and MoE architectures for self-repair, and extend the repair-vs-resampling tradeoff analysis to modern models. A prompt ablation reveals chain-of-thought repair yields up to +5.5 pp additional self-repair gain (measured as improvement in repair delta) over minimal prompting for capable models.
Abstract（参考訳）: 大規模な言語モデルは、最初の試行で正しいコードを生成することができないことが多いが、ほとんどのベンチマークは、それらを単発で評価する。 Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (MoE, 16 expert), Llama 4 Maverick (MoE, 128 experts), Qwen3 32B, Gemini 2.5 Flash, Gemini 2.5 Pro という,3つのモデルとオープンかつプロプライエタリなプロバイダにまたがる反復的な自己修復(修正モデルへの実行エラーの返却)について検討する。最大5回の試行でHumanEval (164問題)とMBPPの衛生化(257問題)について、自己修復は、HumanEval の +4.9 から +17.1 pp、MBPP の +16.0 から +30.0 pp のパスレートを普遍的に改善する。 Gemini 2.5 Flashは最高パスレート(96.3%のHumanEval、93.8%のMBPP)を達成した。誤り型分析では、アサーションエラー(論理的誤り)は45%程度で最も修理が難しいが、構文や名前の誤りは、LLM自己補正の限界に関する広範な知見と結びついている。以前の研究では、弱いモデルは自己修復に失敗するか、微調整が必要であった。また、自己修復のための高密度およびMoEアーキテクチャを初めて比較し、修復-vs-resamplingトレードオフ解析を現代的なモデルに拡張する。プロンプトアブレーションにより、能力のあるモデルに対する最小限のプロンプトよりも、+5.5ppの自己修復利得(修理デルタの改善として測定される)が生じる。

論文の概要: How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

関連論文リスト