Fugu-MT 論文翻訳(概要): DataDignity: Training Data Attribution for Large Language Models

論文の概要: DataDignity: Training Data Attribution for Large Language Models

arxiv url: http://arxiv.org/abs/2605.05687v1
Date: Thu, 07 May 2026 05:27:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.529989
Title: DataDignity: Training Data Attribution for Large Language Models
Title（参考訳）: DataDignity: 大規模言語モデルのトレーニングデータ属性
Authors: Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier,
Abstract要約: 我々は3,537個のウィキペディア風記事のベンチマークであるFakeWikiを紹介した。 FakeWikiにはQAプローブ、ソース保存のパラフレーズ、レトロ生成の変種、解答クリティカルな事実を取り除きながら、極端に類似した硬いアンチドキュメントが含まれている。我々は,7つの検索ベースライン,トレーニング不要なアクティベーション・ステアリング・検索・フュージョン法,SteerFuse,および教師付きコントラスト・プロファイランス・ローダであるScoringModelを評価した。
参考スコア（独自算出の注目度）: 8.195274857647782
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.
Abstract（参考訳）: 監査官は、どのソースドキュメントが応答で表現された知識を最もサポートしているかを特定する必要があるかもしれない。我々は、これをピンポイント証明として研究する: プロンプト、ターゲットモデル応答、および候補コーパスが与えられた場合、応答を最も支持する文書をランク付けする。 FakeWikiは、3,537個のウィキペディア風の記事を制御したベンチマークであり、語彙的ショートカットを弱めつつ、真実の証明を保ちつつ設計されている。 FakeWikiには、QAプローブ、ソース保存のパラフレーズ、レトロ生成の亜種、回答クリティカルな事実を取り除きながら極端に類似したハードアンチドキュメント、クリーンプロンプトと4つのジェイルブレイクにインスパイアされた変換を含む5つのクエリ条件が含まれている。我々は,7つの検索ベースライン,トレーニング不要なアクティベーション・ステアリング・検索・フュージョン法,SteerFuse,および教師付きコントラスト・プロファイランス・ローダであるScoringModelを評価した。 ScoringModelは、レスポンスとドキュメントの機能を共有スペースにマッピングし、In-batch、Research-mined、Anti-document negativesを使用してInfoNCEでトレーニングする。 9つのオープンウェイトな命令チューニング LLM と5つのクエリ条件に対して、ScoringModel は、最強検索ベースラインに対する平均 Recall@10 を35.0 から52.2 に改善する。 SteerFuseは通常、教師付きトレーニングを必要とせず、アクティベーションスペースエビデンスがテキスト検索を効率的に補完できることを示す第2位である。ジェイルブレイクにインスパイアされた変換クエリでは、最高のベースラインよりも平均15.7ポイント改善されている。全体として、ロバストなトレーニングデータ帰属は、真の回答サポートをトピックや語彙的類似から分離する評価設定を必要とすることを、我々の研究は示している。

論文の概要: DataDignity: Training Data Attribution for Large Language Models

関連論文リスト