Fugu-MT 論文翻訳(概要): GIM: Evaluating models via tasks that integrate multiple cognitive domains

論文の概要: GIM: Evaluating models via tasks that integrate multiple cognitive domains

arxiv url: http://arxiv.org/abs/2605.18663v1
Date: Mon, 18 May 2026 17:09:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.121327
Title: GIM: Evaluating models via tasks that integrate multiple cognitive domains
Title（参考訳）: GIM:複数の認知領域を統合したタスクによるモデル評価
Authors: Rohit Patel, Alexandre Rezende, Steven McClain,
Abstract要約: Grounded Integration Measureは、820のオリジナルの問題のベンチマークである。それぞれの問題は、オリジナルの専門家による作曲である。バランスのとれた公民分離は、汚染診断を内蔵する。
参考スコア（独自算出の注目度）: 42.01371688303606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.
Abstract（参考訳）: LLMベンチマークが飽和するにつれて、評価コミュニティは、知識要求(GPQA、HLE)をエスカレートするか、抽象的推論(ARC-AGI)に完全に賛成する知識を取り除くという2つの戦略を追求してきた。第1は記憶と能力を融合させ、第2は、それが重要な実践的な文脈から引き離す。私たちは別のアプローチを取る。接地統合尺度(英語: Grounded Integration Measure, GIM)は、統合が困難である820の元の問題(パブリック、615のプライベート、205のプライベート)のベンチマークである。個々の問題は複数の認知操作(制約満足度、状態追跡、エピステミック・警戒、オーディエンス・キャリブレーション)を広くアクセス可能な知識で調整する必要があるため、推論は専門知識に縛られることなく現実的なタスクに基礎をおくことができる。それぞれの問題は、ルーリック分解されたスコア(中間6は独立に判断された基準)のオリジナルの専門家による構成である。バランスの取れたパブリック・プライベート・スプリットは汚染診断を内蔵している。我々は,28モデルにまたがる200k以上のプロンプト応答ペアに対して,連続応答2パラメータロジスティック(2PL)IRTモデルを校正する。このフレームワークを用いて、22のモデルと47のテスト構成(一意モデル、思考レベルペア)にまたがる包括的なリーダーボードを提示し、35のテスト構成にまたがる11のモデルにおいて、35のテスト構成にまたがる11のモデルにおいて、テスト時の計算能力とモデル能力とのトレードオフに関する最も広範な研究を行う。我々は、予算や量子化といった家庭内構成の選択がモデル選択と同じくらい重要であることを観察する。評価フレームワーク、IRTパラメータの校正、およびすべての公開問題をリリースします。

論文の概要: GIM: Evaluating models via tasks that integrate multiple cognitive domains

関連論文リスト