Fugu-MT 論文翻訳(概要): ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

論文の概要: ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

arxiv url: http://arxiv.org/abs/2512.07795v1
Date: Mon, 08 Dec 2025 18:26:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.999011
Title: ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
Title（参考訳）: ReasonBENCH: LLM Reasoningの(イン)安定性のベンチマーク
Authors: Nearchos Potamitis, Lars Klein, Akhil Arora,
Abstract要約: ReasonBENCHは,大規模言語モデル(LLM)推論における基盤不安定性を定量化する最初のベンチマークである。異なる領域からのタスク全体で、推論戦略とモデルの大部分は高い不安定性を示す。我々はさらに、解答率と安定性のトレードオフに対するプロンプト、モデル家族、スケールの影響を解析する。
参考スコア（独自算出の注目度）: 2.1461777157838724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at https://github.com/au-clan/ReasonBench .
Abstract（参考訳）: 大規模言語モデル(LLM)は、多段階の問題解決や思考の連鎖といった推論が不可欠であるような環境で、ますます多くデプロイされている。しかし、現在の評価手法では、確率的復号から自然に生じる本質的な不確実性を無視しながら、単行精度を圧倒的に報告している。この欠落は、メソッドのパフォーマンスが安定しているか、再現可能か、コスト一貫性があるかを、実践者が確実に評価できないため、盲点を生み出します。 LLM推論における基盤不安定性を定量化する最初のベンチマークであるReasonBENCHを紹介する。 ReasonBENCH (i)推論フレームワーク、モデル、タスクを標準化するモジュラー評価ライブラリ。 (ii)品質とコストの両面で統計的に信頼性のあるメトリクスを報告するマルチランプロトコル (三)分散対応報告を促進するための公共のリーダーボード。異なる領域からのタスク全体で、推論戦略とモデルの大部分は高い不安定性を示す。特に、同様の平均性能の戦略であっても、信頼区間は最大4倍広く表示でき、最高性能の手法はしばしば高いコストとより安定したコストを発生させる。このような不安定さは、実行中の再現性を損なうため、報告されたパフォーマンスの信頼性が損なわれる。これらのダイナミクスをより深く理解するために、我々は、解答率と安定性の間のトレードオフに対するプロンプト、モデル家族、スケールの影響をさらに分析する。本研究は,LLM推論の重要次元としての再現性を強調し,今後の推論手法と不確実性定量化技術の基礎を提供する。 ReasonBENCHはhttps://github.com/au-clan/ReasonBench で公開されている。

論文の概要: ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

関連論文リスト