Fugu-MT 論文翻訳(概要): Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

論文の概要: Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

arxiv url: http://arxiv.org/abs/2605.19723v1
Date: Tue, 19 May 2026 11:56:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.310762
Title: Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
Title（参考訳）: 大規模言語モデルの数学的推論:ベンチマーク、アーキテクチャ、評価、オープンチャレンジ
Authors: Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima,
Abstract要約: 本調査は,Large Language Models (LLMs) を用いた数学的推論の最近の進歩を合成する。本研究は,約120のピアレビュー研究とプレプリントを網羅し,本研究領域の進化について検討した。
参考スコア（独自算出の注目度）: 1.8499314936771558
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.
Abstract（参考訳）: 数学的推論は、教育、科学、産業における問題解決に不可欠であり、人工知能システムを評価する上で重要なベンチマークとなる。大規模言語モデル(LLM)が推論能力を向上するにつれ、数学的推論がいかに優れているかを理解することがますます重要になっている。本研究では,LLMを用いた数学的推論の最近の進歩を,データセット,アーキテクチャ,トレーニング戦略,評価プロトコルの構造化解析を通じて分析する。本研究の体系的レビューは、約120のピアレビュー研究とプレプリントを包含し、この研究領域の進化を検証し、現在の進歩と限界を理解するための統一的な分析フレームワークを提供する。本研究は,特に,事前学習コーパス,教師付き微調整資源,および様々な推論複雑性のレベルにわたる評価ベンチマークを区別し,数学的データセットの統一分類を導入する。推論アーキテクチャと学習戦略を体系的に分析し、ツール統合、検証者誘導推論、パラメータ効率適応が推論堅牢性や一般化に与える影響を評価する。さらに、既存のメトリクスの比較評価では、最終回答精度とプロセスレベルの推論検証のギャップが強調されている。これらの領域における洞察を合成することにより、信頼度問題、ベンチマークバイアス、一般化制限の推論など、繰り返し発生する障害モードを特定し、シンボリックグラウンドの改善、評価信頼性の向上、より堅牢で信頼性の高いLCMベースの推論システムの開発に向けた重要な研究の方向性を概説する。

関連論文リスト

Prediction Model of Motivators and Demotivators of Integrating Large Language Models in Software Engineering Education: An Empirical Study [0.9549646359252346]
大規模言語モデル(LLM)は、ソフトウェアエンジニアリングの実践と教育にますます影響を与えています。本研究は,LLMをソフトウェア工学教育に組み込むための費用対効果戦略の予測モデルを開発し,検証する。
論文参考訳（メタデータ） (2026-05-10T07:41:25Z)
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models [56.656180566692946]
我々は、Schoenfeldのエピソード理論を誘導型中間スケールレンズとして採用し、ThinkARM(モデルにおける推論の解剖学)を紹介する。 ThinkARMは、推論トレースを分析、探索、実装、検証などの機能的推論ステップに明示的に抽象化する。エピソードレベルの表現は推論ステップを明確にし、現代の言語モデルにおける推論がどのように構造化され、安定化され、変更されるかの体系的な分析を可能にする。
論文参考訳（メタデータ） (2025-12-23T02:44:25Z)
Interpretability Framework for LLMs in Undergraduate Calculus [0.0]
大規模言語モデル(LLM)は、教育においてますます使われているが、その正確性だけでは、彼らの問題解決行動の品質、信頼性、教育的妥当性を捉えていない。本稿では,LLM生成解を代表領域として用いた新しい解釈可能性フレームワークを提案する。提案手法は, 推論フロー抽出と解を意味ラベル付き操作や概念に分解し, 即時アブレーション解析と組み合わせて, 入力サリエンスと出力安定性を評価する。
論文参考訳（メタデータ） (2025-10-19T17:20:36Z)
PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning [57.868248683256574]
PRISM-Physicsはプロセスレベルの評価フレームワークであり、複雑な物理推論問題のベンチマークである。解は公式の有向非巡回グラフ(DAG)として表される。その結果,評価フレームワークは人的専門家のスコアと一致していることがわかった。
論文参考訳（メタデータ） (2025-10-03T17:09:03Z)
Teaching LLMs to Think Mathematically: A Critical Study of Decision-Making via Optimization [1.246870021158888]
本稿では,大規模言語モデル(LLM)の数学的プログラミングによる意思決定問題の定式化と解決能力について検討する。まず、LLMがドメイン間の最適化問題をいかに理解し、構造化し、解決するかを評価するため、最近の文献の体系的レビューとメタ分析を行う。計算機ネットワークにおける問題に対する最適化モデルの自動生成において、最先端のLLMの性能を評価するために設計されたターゲット実験により、系統的エビデンスを補完する。
論文参考訳（メタデータ） (2025-08-25T14:52:56Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
医学における大きな言語モデル(LLM)は印象的な能力を実現しているが、体系的で透明で検証可能な推論を行う能力に重大なギャップが残っている。本稿は、この新興分野に関する最初の体系的なレビューを提供する。本稿では,学習時間戦略とテスト時間メカニズムに分類した推論強化手法の分類法を提案する。
論文参考訳（メタデータ） (2025-08-01T14:41:31Z)
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study [15.617243755155686]
対物推論は、大規模言語モデルの推論能力を一般化するための重要な手法として現れてきた。本稿では, 因果関係の構築から, 逆因果関係の介入に対する推論まで, 逆因果関係の生成を分解する分解戦略を提案する。
論文参考訳（メタデータ） (2025-05-17T04:59:32Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
推論ステップの品質を評価するための新しい方法論であるReasonEvalを紹介します。 ReasonEvalはメタ評価データセットのベースライン手法よりも一貫して優れていることを示す。我々は、ReasonEvalがデータ選択において重要な役割を果たすことを観察する。
論文参考訳（メタデータ） (2024-04-08T17:18:04Z)
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [51.26815896167173]
本稿では,3つの相補的な側面からPAMIレビューを総合的に分析する。我々の分析は、現在のレビューの実践において、独特の組織パターンと永続的なギャップを明らかにします。最後に、最先端のAI生成レビューの評価は、コヒーレンスと組織の進歩を奨励していることを示している。
論文参考訳（メタデータ） (2024-02-20T11:28:50Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。