Fugu-MT 論文翻訳(概要): The Evaluation Trap: Benchmark Design as Theoretical Commitment

論文の概要: The Evaluation Trap: Benchmark Design as Theoretical Commitment

arxiv url: http://arxiv.org/abs/2605.14167v1
Date: Wed, 13 May 2026 22:41:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.525444
Title: The Evaluation Trap: Benchmark Design as Theoretical Commitment
Title（参考訳）: 評価トラップ:理論的コミットメントとしてのベンチマーク設計
Authors: Theodore J Kalaitzidis,
Abstract要約: 本稿では,技術能力主張から直接評価基準を導出する手法であるエピステマティクスを紹介する。我々は,建築レベルでの支配的パラダイムの理論的仮定を改訂する提案であるDupoux et al. (2026)の実施監査を通じて,この手順を実証する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. When assumptions function as unexamined commitments, benchmarks stabilize the dominant paradigm by narrowing what counts as progress. Over time, narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions. The result is a trap: evaluation frameworks treat self-reinforcing assessments as valid, both creating and obscuring structural limits on what the current paradigm can accomplish. We introduce Epistematics, a methodology for deriving evaluation criteria directly from technical capability claims and auditing whether proposed benchmarks can discriminate the claimed capability from proxy behaviors. The contribution is meta-evaluative: an audit procedure, a failure mode taxonomy, and benchmark-design criteria for evaluating capability-evaluation coherence. We demonstrate the procedure through a worked audit of Dupoux et al. (2026), a proposal that revises the dominant paradigm's theoretical assumptions at the architectural level while reproducing them in its evaluation criteria, thereby entrenching the constraint it seeks to overcome in a form the evaluation cannot detect.
Abstract（参考訳）: すべてのAIベンチマークは、評価する能力に関する理論的仮定を運用する。仮定が未検討のコミットメントとして機能する場合、ベンチマークは進歩として数えられるものを絞り込むことで支配的なパラダイムを安定化する。アーキテクチャと定義は、評価が独立したオブジェクトを追跡することを止めるまで、ベンチマークの正当性のために選択され、代わりに、自身の運用上の前提によって定義されたターゲットのバージョンを生成する。評価フレームワークは自己強化評価を有効なものとして扱い、現在のパラダイムが達成可能な構造的限界を創造し、隠蔽する。本稿では,評価基準を技術的能力クレームから直接導出する手法であるエピステマティクスを紹介する。コントリビューションはメタ評価的であり、監査手順、障害モード分類、そして能力評価コヒーレンスを評価するためのベンチマーク設計基準である。評価基準を再現しながら、支配的なパラダイムの理論的仮定をアーキテクチャレベルで修正し、評価が検出できない形で克服しようとする制約を解消する提案であるDupoux et al(2026)の作業監査を通じて、手順を実証する。

論文の概要: The Evaluation Trap: Benchmark Design as Theoretical Commitment

関連論文リスト