Fugu-MT 論文翻訳(概要): The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

論文の概要: The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

arxiv url: http://arxiv.org/abs/2606.03305v1
Date: Tue, 02 Jun 2026 08:21:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.859928
Title: The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Title（参考訳）: ベンチマーク監査における信頼性向上:汚染検出の障害モードとしての分布変化とスケール
Authors: Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert,
Abstract要約: トレーニングデータメンバーシップを検出する統計ツールは存在するが、ほとんど制御された学術体制でのみ検証されている。分散シフトとスケール制約という,未調査の2つの障害モードを特定します。 335点中199点しか正しい結果が得られていない。
参考スコア（独自算出の注目度）: 4.921591758479804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.
Abstract（参考訳）: モデルのトレーニングデータに評価例が現れるベンチマーク汚染は、LCM評価の有効性を脅かす。トレーニングデータメンバーシップを検出するための統計ツールは存在するが、大で均一な事前学習コーパスと透明で単一ステージのトレーニングパイプラインという、制御された学術的体制において、ほぼ独占的に検証されている。これらの手法が現実的な監査シナリオで信頼性を維持しているかどうかは不明だ。被疑者および検証セットがIDIの仮定に違反した場合に発生する分散シフトと、ベンチマークが事前学習コーパスよりも桁違いに小さいために発生するスケール制約の2つの未調査障害モードを同定する。我々は,LLMデータセット推論,ポストホックデータセット推論,CoDeCの3つの主要なパラダイムを,複数のファミリー(Pythia,OLMo~2,文化・医療LLMの専門化)とスケール(最大27B)で体系的に評価した。そして、分析をさらにフロンティア産業モデルに拡張します。 335点中199点しか正しい結果が得られていない。 LLMデータセット推論は、分散シフトの下で偽陽性となり、ポストホックデータセット推論はベンチマークスケールで過小評価され、CoDeCは個々のベンチマーク分割を検証するのに不十分な粗い前兆信号のみを提供する。この結果から,制御された検証と実際のベンチマーク監査の体系的な信頼性のギャップが明らかとなり,統計的検出がまだ透過的なデータ証明に取って代わることができないことが示唆された。さらなる研究のためのベンチマークをオープンソースにしています。

論文の概要: The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

関連論文リスト