Fugu-MT 論文翻訳(概要): CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

論文の概要: CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

arxiv url: http://arxiv.org/abs/2506.04481v1
Date: Wed, 04 Jun 2025 22:00:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-06 21:53:49.444162
Title: CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective
Title（参考訳）: CogMath:人間の認知的視点からLLMの正解的数学的能力を評価する
Authors: Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen,
Abstract要約: CogMathは、人間の推論プロセスを3段階に定式化している。各次元において,この次元からLLMの熟達度を評価する問合せを生成するために,emphInquiry-emphJudge-emphReference'のマルチエージェントシステムの開発を行う。 LLMは、9次元からのすべての問い合わせに優れている場合にのみ、真に問題をマスターすると考えられている。
参考スコア（独自算出の注目度）: 68.94793547575343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assesses LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30\%-40\%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
Abstract（参考訳）: 大規模言語モデル(LLM)は複雑な数学的タスクを解く上で有望であるが、既存の評価パラダイムは、その正当性を評価するのに不十分な、全体の回答精度の粗い尺度にのみ依存している。本稿では,人間の認知のレンズを通してLLMの数学的能力を包括的に評価する「textbf{CogMath}」を提案する。具体的には、心理学理論にインスパイアされたCogMathは、人間の推論プロセスを3段階に分類する: \emph{problem comprehension}、 \emph{problem solve}、 \emph{solution summarization}。これらの段階では,数値計算,知識,反事実といった視点を考察し,9つのきめ細かい評価次元を設計する。各次元において、この次元から LLM の熟達度を評価する問合せを生成するために ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}' のマルチエージェントシステムを開発する。 LLMは、9次元からのすべての問い合わせに優れている場合にのみ、真に問題をマスターすると考えられている。 3つのベンチマークにCogMathを適用することで、7つの主要なLCMの数学的能力が30-40-%過大評価されていることが明らかになった。さらに、特定の段階/次元にまたがる強みや弱みを見つけ出し、推論能力をさらに強化するための深い洞察を提供する。

関連論文リスト

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHayは、LLMの長文数学的推論能力を評価するために設計された自動ベンチマークである。我々は,8つのトップパフォーマンスモデルの長文数学的推論能力を評価するために,MathHayの広範な実験を行った。
論文参考訳（メタデータ） (2024-10-07T02:30:07Z)
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBenchは、大規模言語モデルの数学的能力を厳格に評価する新しいベンチマークである。 MathBenchは幅広い数学の分野にまたがっており、理論的な理解と実践的な問題解決のスキルの両方を詳細に評価している。
論文参考訳（メタデータ） (2024-05-20T17:52:29Z)
Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange [25.419977967846144]
大規模言語モデル(LLM)は、様々な自然言語タスクにおいて例外的な機能を示した。本稿では、複雑な数学的問題解決をナビゲートする上でのLLMの限界について考察する。
論文参考訳（メタデータ） (2024-03-30T12:48:31Z)
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
MLLMの公平かつ詳細な評価のために設計された全周視覚数学ベンチマークであるMathVerseを紹介する。我々は,2,612の高品位・多目的数学問題を,公開情報源の図を用いて慎重に収集する。このアプローチにより、MathVerseは、数学的推論のためのビジュアルダイアグラムを、どの程度のMLLMが真に理解できるかを包括的に評価することができる。
論文参考訳（メタデータ） (2024-03-21T17:59:50Z)
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models [44.63505885248145]
FineMathは、中国語大言語モデル(LLM)を評価するための詳細な数学的評価ベンチマークデータセットである。 FineMathは、小学校数学で教えられる主要な数学的概念をカバーし、数学用語の問題の17のカテゴリに分けられる。数学の単語問題のうち17のカテゴリは、これらの問題を解決するために必要な推論ステップの数に応じて、難易度を手動でアノテートする。
論文参考訳（メタデータ） (2024-03-12T15:32:39Z)
MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
大規模言語モデル(LLM)は問題解決において顕著な能力を示した。しかし、数学的な問題を解く能力は依然として不十分である。高品質な数学的推論データを作成するためのシンプルでスケーラブルな方法であるMathScaleを提案する。
論文参考訳（メタデータ） (2024-03-05T11:42:59Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
大規模言語モデル (LLM) は、様々な数学的推論ベンチマークで顕著な性能を達成している。 1つの必須かつ頻繁な証拠は、数学の質問がわずかに変更されたとき、LLMは誤って振る舞うことができることである。このことは, LLMの数学推論能力の頑健性を評価するために, 幅広い質問のバリエーションを試すことによるものである。
論文参考訳（メタデータ） (2024-02-29T15:26:14Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。