Fugu-MT 論文翻訳(概要): Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

論文の概要: Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2605.09292v1
Date: Sun, 10 May 2026 03:38:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.171867
Title: Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Title（参考訳）: LLM数学的推論における戦略多様性の評価
Authors: Xia Yang, Xuanyi Zhang, Hao Hu, Feng Ji,
Abstract要約: AMC 10/12問題とAIME問題に基づく戦略レベル評価フレームワーク。回答の正確さと戦略の多様性の間には明らかな疎結合がある。 Gemini, DeepSeek, GPT, Claudeは184, 152, 151, 110の異なる有効な戦略を生成する。
参考スコア（独自算出の注目度）: 11.576914513156316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models collectively produce 50 benchmark-novel valid strategies, indicating both incomplete coverage of human strategies and some capacity for alternative reasoning. A repeated-run robustness check on 20 problems shows diminishing gains in discovered strategies, with the strongest model recovering only 39 of 55 AoPS-reference strategies (71%) after three runs. These findings position strategy diversity as a complementary dimension for evaluating mathematical reasoning beyond answer correctness.
Abstract（参考訳）: 大規模言語モデルは、数学的推論ベンチマークにおいて高い最終回答精度を達成するが、精度だけでは推論の柔軟性を捉えない。 80 AMC 10/12でインスタンス化された戦略レベル評価フレームワークと217 AoPS由来の基準戦略ファミリを用いたAIME問題を提案する。モデル出力は、人間の判断による二重AI符号化を用いて、戦略の同一性、妥当性、正当性を注釈付けする。 4つのフロンティアモデル全体では、解答精度と戦略の多様性の間に明らかな疎結合がある。単一解法プロンプトの下では、全てのモデルは高い精度(95%-100%)を達成するが、多重ストラテジープロンプトでは、人間の参照セットよりもかなり少ない戦略を回復する。 Gemini, DeepSeek, GPT, Claudeはそれぞれ184, 152, 151, 110の異なる有効戦略を生成する。これらのモデルは、総合的に50のベンチマークノーベル有効戦略を生成し、人間の戦略の不完全なカバレッジと、代替的推論の能力の両方を示している。 20の問題を繰り返すロバスト性チェックは、発見戦略における利得の低下を示し、最強のモデルは3回の実行後に55のAoPS-参照戦略 (71%) のうち39しか回復しなかった。これらの知見は,回答の正しさを超えた数学的推論を評価するための相補的な次元として戦略の多様性を位置づけた。

論文の概要: Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

関連論文リスト