Fugu-MT 論文翻訳(概要): CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

論文の概要: CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

arxiv url: http://arxiv.org/abs/2511.18889v1
Date: Mon, 24 Nov 2025 08:44:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-25 18:34:25.116151
Title: CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
Title（参考訳）: CoreEval: 信頼性LLM評価に向けた実世界の知識による汚染耐性データセットの自動構築
Authors: Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang, Bin Liang, Jing Li, Ruifeng Xu,
Abstract要約: データ汚染は、自然言語処理タスクにおけるLLM評価の公平性にとって重要な課題である。実世界の知識でデータを自動的に更新する戦略である textbfCoreEval を提案する。
参考スコア（独自算出の注目度）: 38.14943360647566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
Abstract（参考訳）: データ汚染は、自然言語処理タスクにおけるLLM評価の公平性に対して、トレーニング中のテストデータに不注意にモデルを露出させることで、重要な課題となる。現在の研究では、既存のデータセットを変更したり、新たに収集された情報から新しいデータセットを生成することで、この問題を緩和しようとしている。しかしながら、これらの手法は、モデルから既存の知識を完全に排除したり、元のデータセットのセマンティックな複雑さを維持できないため、汚染耐性評価を保証するには不十分である。これらの制約に対処するため、実世界の知識で自動的にデータを更新するための \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation 戦略を提案する。このアプローチは、元のデータからエンティティ関係を抽出し、GDELTデータベースを利用して関連する最新の知識を取得することから始まります。検索した知識は再テキスト化され、元のデータと統合され、セマンティック・コヒーレンスとタスク関連性の向上を保証するために洗練・再構成される。最終的に、ロバストなデータリフレクションメカニズムを使用してラベルを反復的に検証し、洗練し、更新されたデータセットと元のデータセット間の一貫性を保証する。更新データセットに関する大規模な実験は、CoreEvalの堅牢性を検証し、データ汚染によるパフォーマンス過大評価を緩和する効果を実証した。

論文の概要: CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

関連論文リスト