Fugu-MT 論文翻訳(概要): The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

論文の概要: The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

arxiv url: http://arxiv.org/abs/2603.03334v1
Date: Wed, 11 Feb 2026 10:20:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.163064
Title: The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?
Title（参考訳）: CompMath-MCQデータセット:LLMは高レベル数学に対応しているか?
Authors: Bianca Raimondi, Francesco Pivi, Davide Evangelista, Maurizio Gabbrielli,
Abstract要約: CompMath-MCQは、複数選択設定で数学的推論を評価するための新しいベンチマークデータセットである。このデータセットは、卒業生レベルのコースの教授による1500の質問から成り立っている。
参考スコア（独自算出の注目度）: 1.2891189282516038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: https://github.com/biancaraimondi/CompMath-MCQ.git
Abstract（参考訳）: 数学推論におけるLarge Language Models (LLMs) の評価は、基本的な問題、競合スタイルの質問、あるいは形式的な定理の証明に主に焦点を合わせており、大学院レベルの数学と計算数学は比較的過小評価されている。複数選択条件下での高度な数学的推論に基づいてLLMを評価するための新しいベンチマークデータセットであるCompMath-MCQを紹介する。データセットは1{,}500で、もともとは卒業生レベルのコースの教授が作成した質問から成り、線形代数、数値最適化、ベクトル計算、確率、Pythonベースの科学計算などのトピックをカバーしている。質問ごとに3つの選択肢が与えられ、その中の1つが正確に正しい。データ漏洩を確実にするために、すべての質問が新しく作成され、既存の資料から出されていない。質問の妥当性は、クロスLLMの不一致に基づく手順で検証され、続いてマニュアル専門家によるレビューが続く。複数選択形式を採用することで,lm_evalライブラリによる目的,再現性,バイアスのない評価が可能になる。最先端のLCMによるベースラインの結果は、高度な計算数学的推論が依然として重要な課題であることを示している。我々は以下のリンクでCompMath-MCQをリリースした。

論文の概要: The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

関連論文リスト