Fugu-MT 論文翻訳(概要): Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

論文の概要: Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

arxiv url: http://arxiv.org/abs/2510.16062v2
Date: Wed, 22 Oct 2025 09:04:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:11.725338
Title: Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
Title（参考訳）: LLMの正解は可能か? LLMにおける自己補正のベンチマーク
Authors: Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, Lichao Sun,
Abstract要約: 大規模言語モデル(LLM)の自己補正は、推論性能を高める重要な要素として現れる。本研究では,自己補正戦略の有効性を評価するためのベンチマークであるCorrectBenchを紹介する。その結果,1) 自己補正手法は, 複雑な推論タスクにおいて, 精度を向上させることが可能であり, 2) 異なる自己補正戦略の混合により, 効率は低下するものの, さらなる改善がもたらされることが明らかとなった。
参考スコア（独自算出の注目度）: 57.10533368622962
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/
Abstract（参考訳）: 大規模言語モデル(LLM)の自己補正は、推論性能を高める重要な要素として現れる。様々な自己補正手法が提案されているが、これらの手法の包括的な評価はほとんど未検討であり、LSMが真に正しうるかどうかという問題は重要な関心と関心の問題である。本研究では,コモンセンス推論,数学的推論,コード生成という3つのタスクにまたがって,内在的,外的,微調整的アプローチを含む自己補正戦略の有効性を評価するためのベンチマークであるCorrectBenchを紹介する。私たちの発見は、こう示しています。 1)自己補正法は,特に複雑な推論作業において,精度を向上させることができる。 2 異なる自己補正戦略の混合は、効率を低下させるが、更なる改善をもたらす。 3) Reasoning LLMs (例: DeepSeek-R1) は追加の自己補正法の下での最適化に制限があり, 時間的コストが高い。興味深いことに、比較的単純なCoTベースラインは、競合する精度と効率を示す。これらの結果は、LLMの推論性能を高めるための自己補正の可能性を強調しつつ、その効率を改善するための継続的な課題を強調している。その結果、推論能力と運用効率のバランスを最適化することに焦点を当てたさらなる研究を提唱する。 Project Page: https://correctbench.github.io/

論文の概要: Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

関連論文リスト