Fugu-MT 論文翻訳(概要): Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

論文の概要: Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

arxiv url: http://arxiv.org/abs/2511.04655v1
Date: Thu, 06 Nov 2025 18:43:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.569354
Title: Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Title（参考訳）: ベンチマークデザイナは、エクスプロイタブルな非ビジュアルショートカットを出力するために、"テストセットのトレイン"をすべきである
Authors: Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie,
Abstract要約: 強力な視覚的理解なしに、モデルが多くのベンチマークを達成できることがわかりました。これは視覚的な入力を意図した視覚中心のベンチマークでは特に問題となる。ベンチマーク設計には診断原則を採用しており、もしベンチマークをゲーム化できれば、それをゲーム化します。
参考スコア（独自算出の注目度）: 49.99400612296149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.
Abstract（参考訳）: ロバストベンチマークは、MLLM(Multimodal Large Language Models)の評価に不可欠である。しかし、モデルが強い視覚的理解なしに多くのマルチモーダルベンチマークを取得でき、代わりにバイアス、言語的先行、表面パターンを活用できることがわかった。これは視覚的な入力を意図した視覚中心のベンチマークでは特に問題となる。ベンチマーク設計には診断原則を採用しています。したがって、デザイナはまず、診断と偏見を使用して、視覚的でないバイアスを体系的に識別し軽減する、独自のベンチマークを‘ゲーム’しようとすべきである。効果的な診断には、'`テストセットのトレーニング' -- 固有の、悪用可能なパターンに対して、リリース済みのテストセットを検索する必要がある。この標準を2つのコンポーネントで運用します。まず、ベンチマークの感受性を ``Test-set Stress-Test'' (TsT) 方法論を用いて診断する。我々の主要な診断ツールは、ショートカット性能を明らかにし、各サンプルに$s(x)$のバイアススコアを割り当てるために、テストセットの視覚的でないテキスト入力のみに$k$-foldクロスバリデーションを通じて強力なLarge Language Modelを微調整することである。我々は、高速で解釈可能な監査のために手作りの機能を利用する軽量なランダムフォレストベースの診断でこれを補完する。次に,<Iterative Bias Pruning'' (IBP) 法を用いてハイバイアスサンプルをフィルタすることにより,ベンチマークをデバイアス化する。このフレームワークをVSI-Bench、CV-Bench、MMMU、VideoMMEの4つのベンチマークに適用することで、広範に広がる非視覚バイアスを明らかにします。ケーススタディでは、VSI-Bench-Debiasedを作成するためのフルフレームワークを適用し、非視覚的解決可能性の低減と、オリジナルよりも広い視覚的ブラインド性能のギャップを示す。

論文の概要: Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

関連論文リスト