Fugu-MT 論文翻訳(概要): DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

論文の概要: DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

arxiv url: http://arxiv.org/abs/2505.08744v1
Date: Tue, 13 May 2025 16:58:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-14 20:57:54.683057
Title: DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models
Title（参考訳）: DeepMath-Creative: 大規模言語モデルの数学的創造性を評価するベンチマーク
Authors: Xiaoyang Chen, Xinan Dai, Yu Du, Qian Feng, Naixu Guo, Tingshuo Gu, Yuting Gao, Yingyi Gao, Xudong Han, Xiang Jiang, Yilin Jin, Hongyi Lin, Shisheng Lin, Xiangnan Li, Yuante Li, Yixing Li, Zhentao Lai, Zilu Ma, Yingrong Peng, Jiacheng Qian, Hao-Yu Sun, Jianbo Sun, Zirui Wang, Siwei Wu, Zian Wang, Bin Xu, Jianghao Xu, Yiyang Yu, Zichuan Yang, Hongji Zha, Ruichong Zhang,
Abstract要約: DeepMathチームはオープンな数学的LLMの開発を目的としたオープンソースイニシアチブを立ち上げた。本稿は、このイニシアチブの初期の貢献を示す。
参考スコア（独自算出の注目度）: 22.050241159312307
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria -- emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations -- the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.
Abstract（参考訳）: 大規模言語モデル(LLM)の数学的習熟度を向上するため、DeepMathチームはオープンな数学的LLMを開発し、その数学的創造性を体系的に評価することを目的としたオープンソースイニシアチブを立ち上げた。本稿は、このイニシアチブの初期の貢献を示す。数学LLMの最近の発展は、小学校から学部レベルの数学タスクのベンチマークによって証明されているように、推論スキルを主に重視しているが、これらのモデルの創造的能力は比較的ほとんど注目されず、評価データセットは依然として不足している。このギャップに対処するために,数学的創造性の評価基準を提案し,代数,幾何学,解析,その他の領域にまたがる構成上の問題を構成する新しい高品質なベンチマークであるDeepMath-Creativeを導入する。本研究では,本データセットを用いて,LLMの創造的問題解決能力の体系的評価を行う。実験結果から、厳密な評価基準の下でも、中核的なソリューションコンポーネントを強調し、小さな論理的ギャップ、不完全正当性、冗長な説明など、小さな不正確さを無視している。より複雑な問題に対して、モデルはオープンな問題に対して実質的な戦略を提供していないため、パフォーマンスは急激に低下する。これらの結果から,現在のLLMは, 慣れ親しんだ, 低い微分問題に対して, 建設的熟練度を示すが, その性能は, 真の創造的洞察や新規な合成よりも, 記憶パターンの再結合に起因している可能性が示唆された。

関連論文リスト

ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges [72.19809898215857]
ModelingBenchは、様々な領域にわたる数学モデリングの競争から、現実に着想を得たオープンエンドの問題を特徴付ける新しいベンチマークである。これらのタスクには、自然言語を形式的な数学的定式化に翻訳し、適切なツールを適用し、構造化された防御可能なレポートを生成する必要がある。ツール使用をコーディネートするマルチエージェントフレームワークである ModelingAgent も紹介します。
論文参考訳（メタデータ） (2025-05-21T03:33:23Z)
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics [21.453837660747844]
大規模言語モデル(LLM)における数学的推論を評価するための既存のベンチマークは、主に競合問題、公式な証明、人工的な問題に依存している。論文や数理フォーラムから直接派生した新しいベンチマークであるRealMathを導入し,実数理タスクにおけるLLMの能力を評価する。
論文参考訳（メタデータ） (2025-05-18T23:32:46Z)
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class [27.93059568425132]
HARDMath2は、大学院応用数学クラスの中核トピックをカバーする211のオリジナル問題のデータセットである。このデータセットはハーバード大学のコア大学院数学コースの学生とインストラクターによって設計され、検証された。このデータセットは、学生がクラスシラバスと整合した難しい問題を書き、洗練するよう促す、新しい協調環境を通じて構築されます。
論文参考訳（メタデータ） (2025-05-17T00:52:49Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATHは、LLMの複雑な推論能力を厳格にテストするために設計された、Olympiadレベルの新しい数学ベンチマークである。 OlymMATHは200の厳密にキュレートされた問題があり、それぞれが手動で検証され、英語と中国語の並行バージョンで利用可能である。
論文参考訳（メタデータ） (2025-03-27T11:20:17Z)
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM [58.42678619252968]
Creation-MMBenchはマルチモーダル大言語モデルの創造性を評価するために設計されたベンチマークである。ベンチマークは、51のきめ細かいタスクにまたがる765のテストケースで構成されている。実験結果から,オープンソースのMLLMは,クリエイティブタスクにおけるプロプライエタリなモデルに比べて著しく性能が劣っていることが明らかとなった。
論文参考訳（メタデータ） (2025-03-18T17:51:34Z)
Large Language Models for Mathematical Analysis [3.7325315394927023]
この研究は、数学的推論における重要なギャップに対処し、信頼できるAIの進歩に寄与する。 DEMI-MathAnalysisデータセットを開発した。また,LLMの問題解決能力を高めるためのガイドフレームワークも設計した。
論文参考訳（メタデータ） (2024-12-28T20:37:55Z)
LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages [14.04286044600141]
大規模言語モデル (LLM) は様々な自然言語処理タスクにおいて高い性能を示している。しかし、数学的推論の習熟度は依然として重要な課題である。 LLMの数学的モデル構築能力を評価するためのプロセス指向フレームワークを提案する。
論文参考訳（メタデータ） (2024-05-21T18:29:54Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
推論ステップの品質を評価するための新しい方法論であるReasonEvalを紹介します。 ReasonEvalはメタ評価データセットのベースライン手法よりも一貫して優れていることを示す。我々は、ReasonEvalがデータ選択において重要な役割を果たすことを観察する。
論文参考訳（メタデータ） (2024-04-08T17:18:04Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
算術的推論とコード生成という,2つの一般的な推論タスクに注目します。 i) 数学やコーディング問題に対する摂動の一般的なオントロジー, (ii) 摂動を応用するための半自動手法, (iii) 2つのデータセットを紹介する。混乱した質問に対して、すべてのモデルで大幅なパフォーマンス低下を示します。
論文参考訳（メタデータ） (2024-01-17T18:13:07Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
拡張ベンチマークスイートSciBench for Large Language Model (LLM)を導入する。 SciBenchには、数学、化学、物理学の分野から、さまざまな大学レベルの科学的問題を含むデータセットが含まれている。その結果、現在のLLMは満足のいく性能を達成できないことが判明し、全体のスコアは43.22%に過ぎなかった。
論文参考訳（メタデータ） (2023-07-20T07:01:57Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。