Fugu-MT 論文翻訳(概要): How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

論文の概要: How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

arxiv url: http://arxiv.org/abs/2604.04791v1
Date: Mon, 06 Apr 2026 15:58:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.267156
Title: How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling
Title（参考訳）: どのくらい遠いのか?モデリングにおける数学的コンテストにおけるLLMと人間専門家の体系的評価
Authors: Yuhang Liu, Heyan Huang, Yizhe Yang, Hongyan Zhao, Zhizhuo Zeng, Yang Gao,
Abstract要約: 大規模言語モデル(LLM)は推論ベンチマークにおいて高いパフォーマンスを達成しているが、エンドツーエンドを必要とする現実世界の問題を解決する能力は未だ不明である。本稿では、専門家が検証した基準を用いて、モデリング段階間でのLCM性能を評価する問題指向の段階評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 43.64673843846063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.
Abstract（参考訳）: 大規模言語モデル(LLM)は推論ベンチマークで高いパフォーマンスを達成しているが、エンドツーエンドのワークフローを必要とする現実の問題を解決する能力は未だに不明である。数学的モデリングコンペティションは、このようなエンドツーエンドの問題解決能力を評価するための厳密なテストベッドを提供する。本稿では、専門家が検証した基準を用いて、モデリング段階間でのLCM性能を評価する問題指向の段階評価フレームワークを提案する。本研究は,中国大学院数学コンテストにおける課題に対する,自動スコアと人的専門家による独立した判断とを比較して,フレームワークの信頼性を検証し,既存の評価手法よりもはるかに強力なアライメントを示す。このフレームワークを用いて,現状のLCMの理解と実行のギャップを明らかにする。問題の同定や定式化といった初期段階でよく機能するが,モデル解析やコード実装,結果解析など,実行指向の段階において永続的な欠陥を示す。これらのギャップは、モデルスケールが増大しても持続する。さらに、これらの失敗を、不十分な仕様、検証の欠如、検証の欠如まで追跡する。このギャップを埋めるには、モデルスケーリング以上のアプローチが必要で、複雑な現実世界の問題解決にLLMを適用するための洞察を提供する。

論文の概要: How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

関連論文リスト