Fugu-MT 論文翻訳(概要): Search-Time Data Contamination

論文の概要: Search-Time Data Contamination

arxiv url: http://arxiv.org/abs/2508.13180v1
Date: Tue, 12 Aug 2025 22:52:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.632519
Title: Search-Time Data Contamination
Title（参考訳）: 検索時間データ汚染
Authors: Ziwen Han, Meher Mankikar, Julian Michael, Zifan Wang,
Abstract要約: 探索時間汚染(STC)という類似の問題を検索ベースLLMエージェントの評価において同定する。評価データセットをホストするオンラインプラットフォームであるHuggingFaceが,検索ベースのエージェントログから検索したソースの中に現れることがわかった。本稿では,ベンチマーク設計のベストプラクティスと,この新たなリークに対処するための結果報告を提案する。
参考スコア（独自算出の注目度）: 18.94571261664399
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Data contamination refers to the leakage of evaluation data into model training data, resulting in overfitting to supposedly held-out test sets and compromising test validity. We identify an analogous issue, search-time contamination (STC), in evaluating search-based LLM agents which use tools to gather information from online sources when answering user queries. STC occurs when the retrieval step surfaces a source containing the test question (or a near-duplicate) alongside its answer, enabling agents to copy rather than genuinely infer or reason, undermining benchmark integrity. We find that HuggingFace, an online platform hosting evaluation datasets, appears among retrieved sources in search based agent logs. Consequently, agents often explicitly acknowledge discovering question answer pairs from HuggingFace within their reasoning chains. On three commonly used capability benchmarks: Humanity's Last Exam (HLE), SimpleQA, and GPQA, we demonstrate that for approximately 3% of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace. When millions of evaluation queries target the same benchmark, even small, repeated leaks can accelerate the benchmark's obsolescence, shortening its intended lifecycle. After HuggingFace is blocked, we observe a drop in accuracy on the contaminated subset of approximately 15%. We further show through ablation experiments that publicly accessible evaluation datasets on HuggingFace may not be the sole source of STC. To this end, we conclude by proposing best practices for benchmark design and result reporting to address this novel form of leakage and ensure trustworthy evaluation of search-based LLM agents. To facilitate the auditing of evaluation results, we also publicly release the complete logs from our experiments.
Abstract（参考訳）: データ汚染(Data contamination)とは、評価データをモデルトレーニングデータに漏洩させることで、保持されたテストセットに過度に適合し、テストの妥当性を損なうことを意味する。我々は,検索時間汚染(STC)という類似の問題を,ユーザクエリ応答時にオンラインソースから情報を集めるツールを用いた検索ベースLLMエージェントの評価において特定する。 STCは、検索ステップが、テスト質問(またはほぼ重複している)を含むソースをその回答と共に表面化し、エージェントが真に推論や理由ではなくコピーできるようにし、ベンチマークの整合性を損なう。評価データセットをホストするオンラインプラットフォームであるHuggingFaceが,検索ベースのエージェントログから検索したソースの中に現れることがわかった。その結果、エージェントはしばしば、推論チェーン内のHuggingFaceからの質問応答ペアの発見を明示的に認めます。 HumanityのLast Exam(HLE)、SimpleQA、GPQAの3つの一般的な機能ベンチマークでは、約3%の質問に対して、検索ベースのエージェントがHuggingFaceの地上の真理ラベルでデータセットを直接見つけることを示した。何百万もの評価クエリが同じベンチマークをターゲットにする場合、たとえ小さな繰り返しリークであっても、ベンチマークの陳腐化を加速し、意図したライフサイクルを短縮することができる。 HuggingFaceがブロックされた後、汚染されたサブセットの精度が約15%低下するのを観察します。さらに,HuggingFace上で公開されている評価データセットがSTCの唯一の情報源ではないことを示す。この目的のために,ベンチマーク設計と結果報告のベストプラクティスを提案し,この新たなタイプの漏洩に対処し,検索に基づくLCMエージェントの信頼性評価を確実にする。評価結果の監査を容易にするため,実験結果から全ログを公開する。

論文の概要: Search-Time Data Contamination

関連論文リスト