Fugu-MT 論文翻訳(概要): Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

論文の概要: Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

arxiv url: http://arxiv.org/abs/2511.18649v2
Date: Sun, 30 Nov 2025 15:52:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 13:32:07.417869
Title: Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting
Title（参考訳）: 2026年韓国数学演習における大規模言語モデルの評価:ゼロデータリーク設定における数学的能力の測定
Authors: Goun Pyeon, Inbum Heo, Jeesu Jung, Taewook Hwang, Hyuk Namgoong, Hyein Seo, Yerim Han, Eunbin Kim, Hyeonseok Kang, Sangkeun Jung,
Abstract要約: 本研究は,2026年の韓国・カレッジ・スコラスティック能力テスト(CSAT)を用いた大規模言語モデル(LLM)の数学的推論能力について,体系的に評価した。既存のベンチマークにおけるデータ漏洩問題に対処するため、試験公開から2時間以内に46の質問(22件、24件)をすべてデジタル化した。
参考スコア（独自算出の注目度）: 5.313647446600863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (Text-only, Image-only, Text+Figure) and prompt languages (Korean, English). The GPT-5 family models achieved perfect scores (100 points) under a limited set of language-modality configurations, while Grok 4, Qwen 3 235B, and Gemini 2.5 pro also scored above 97 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed Calculus as the weakest domain with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (82.6->100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a standardized digitization pipeline that converts human-targeted exam materials into LLM-ready evaluation data, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard; https://isoft.cnu.ac.kr/csat2026/
Abstract（参考訳）: 本研究は,2026年の韓国・カレッジ・スコラスティック・アビリティ・テスト(CSAT)数学部を用いて,大規模言語モデル(LLM)の数学的推論能力を体系的に評価し,完全汚染のない評価環境を確保する。既存のベンチマークにおけるデータ漏洩問題に対処するため、試験公開から2時間以内に46の質問(22の共通点と24の選択的点)をすべてデジタル化し、モデルトレーニングデータに含める可能性を排除した。入力モダリティ(テキストオンリー,画像オンリー,テキスト+フィギュア)とプロンプト言語(韓国語,英語)にまたがる24種類のLLMの総合評価を行った。 GPT-5ファミリーモデルは言語モダリティの限定セットで完全なスコア(100点)を獲得し、Grok 4、Qwen 3 235B、Gemini 2.5 Proも97点を超えた。特にgpt-oss-20Bは比較的小さなサイズにもかかわらず95.7ポイントを達成し、高い費用対効果を示した。問題特異的解析により, 4点高微分問題において, 高い性能低下を示す最弱領域としてCalculusが明らかになった。テキスト入力は画像入力より一貫して優れ、言語効果はモデルスケールによって変化した。 GPT-5シリーズによる推論強化実験では、推論強度が向上し(82.6->100点)、4倍のトークンの使用と大幅に効率が低下し、最小の推論を持つモデルはより実用的である可能性が示唆された。本研究は,(1)未公開評価環境の実装,(2)人間を対象とする試験材料をLCM対応評価データに変換する標準化されたデジタル化パイプライン,(3)性能,コスト,時間を考慮した実用的評価視点の実現に寄与する。詳細な結果とモデルの比較は、2026年の韓国 CSAT LLM Evaluation Leaderboard; https://isoft.cnu.ac.kr/csat2026/ で見ることができる。

論文の概要: Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

関連論文リスト