Fugu-MT 論文翻訳(概要): EEFSUVA: A New Mathematical Olympiad Benchmark

論文の概要: EEFSUVA: A New Mathematical Olympiad Benchmark

arxiv url: http://arxiv.org/abs/2510.01227v1
Date: Tue, 23 Sep 2025 01:57:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.726512
Title: EEFSUVA: A New Mathematical Olympiad Benchmark
Title（参考訳）: EEFSUVA: 新しい数学的オリンピックベンチマーク
Authors: Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner,
Abstract要約: 我々は,大規模言語モデル (LLM) がOlympiad のゴールドメダルと数学ベンチマークの卒業レベルの習熟度に一致していると主張している。我々は,東欧及び旧ソ連の国々で流通している地域および全国のオリンピアードから収集された新しいベンチマークであるEEFSUVAを紹介する。予備的な結果は、最先端のLLMでさえ、他のオリンピアド型ベンチマークと比較して、EEFSUVAは顕著な性能低下を示していることを示唆している。
参考スコア（独自算出の注目度）: 1.7589620883907298
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.
Abstract（参考訳）: 最近のブレークスルーは、大きな言語モデル(LLM)がオリンピアードの金メダルと数学のベンチマークの卒業レベルの習熟度に一致するという主張を刺激している。本研究では,これらの主張を詳細に検討し,現在のベンチマークが真の LLM の数学的推論をどの程度捉えているかを評価する。これらのベンチマークの構成は、主に国際数学オリンピアード(IMO)と関連するコンペティションから引き出されたものであり、潜在的なデータ汚染による推論能力を誇張し、よく知られた問題タイプに焦点を絞っている可能性がある。数学的理解のより包括的評価を可能にするため,東欧及び旧ソ連の国に分布する地域・国別オリンピアードを対象とする新たなベンチマークであるEEFSUVAを導入する。これらのコンテストは、IMOに匹敵する難しさの問題を特徴とし、非標準の問題解決技術を要求することで有名であるが、オンラインコーパスでは、その問題がはるかに少ない。予備的な結果は、最先端のLLMでさえ、他のオリンピアド型ベンチマークと比較して、EEFSUVAは顕著な性能低下を示していることを示唆している。これらの結果は、数学的推論の完全な評価と将来のモデル開発を導くために、より広範な評価データセットの重要性を示唆している。

論文の概要: EEFSUVA: A New Mathematical Olympiad Benchmark

関連論文リスト