Fugu-MT 論文翻訳(概要): KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

論文の概要: KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

arxiv url: http://arxiv.org/abs/2606.10403v2
Date: Thu, 11 Jun 2026 09:33:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 13:39:59.580534
Title: KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty
Title（参考訳）: KCSAT-ML:全国規模の人的困難を伴う推論モデルの提案
Authors: Sanghee Park, Geewook Kim, Kee-Eung Kim,
Abstract要約: 数学推論のベンチマークは急増しているが、ほとんどの場合、実際の人間のパフォーマンスに根ざした難易度信号が欠落している。韓国・カレッジ・スコラスティック能力試験(KCSAT; Suneung)の10年間(2014-2025)の数学であるKCSAT-MLを紹介する。このベンチマークと、Difficulty-aligned Reasoning Gain(DRG)というスコア・オルソニカルな指標を組み合わせて、モデルのミスは、人間が難しいと感じた項目に、あるいは人間が簡単に見つけた項目に、集中するかどうかを問う。
参考スコア（独自算出の注目度）: 25.797770168050885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.
Abstract（参考訳）: 数学推論のベンチマークは急増しているが、ほとんどの場合、実際の人間のパフォーマンスに根ざした難易度信号が欠落している。韓国大学学生能力試験(KCSAT; Suneung)の10年(2014-2025年)のKCSAT-MLについて紹介する。ベンチマークをDifficulty-aligned Reasoning Gain (DRG) と組み合わせる — モデルのミスが、人間が難しいと感じた項目や、人間が簡単に見つけた項目に集中しているかどうかを問うスコアと直交の指標です。それらは共に、広範囲のVLM(およびOCR経由のLLM)にまたがって、3つのパターンを露呈する。 (i)低予算の精度は、各モデルサイズにおいて、ハイヒューマンエラーテールに崩壊する。 (ii)テストタイムスケーリング(TTS)は、コホート誤差率とほぼ線形にトークンの使用を上昇させ、精度ゲインは非単調曲線に従う。 (iii)1つのファミリーの中で、TSは最も難しいアイテムのスケーリングと、より簡単なもの(同じアライメント障害の2つの顔)のオーバー思考の間を行き来します。 DRGでは、ほぼ同一の精度のモデルは、ほぼ正反対の値に収まる: あるモデルは、人間が困難に感じるものを間違えるが、別のモデルは、人間が容易に見つけられるアイテム上で最も難しいアイテムを解決する。私たちのコードとデータセットビルダーはhttps://github.com/naver-ai/KCSAT-MLでオープンソース化されます。

論文の概要: KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

関連論文リスト