Fugu-MT 論文翻訳(概要): Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

論文の概要: Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

arxiv url: http://arxiv.org/abs/2605.25902v2
Date: Tue, 02 Jun 2026 12:10:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 18:57:50.060779
Title: Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing
Title（参考訳）: ファインタニング前の読み方:コントラストデコードディッフィングによるVerbatimコンテンツリカバリ
Authors: Michał Brzozowski, Zuzanna Dubanowska, Enrico Cassano, Neo Christopher Chung,
Abstract要約: Contrastive Decoding Diffing (CDD) は、出力レベルのロジット分布のみを演算し、ウェイトアクセスがなく、層選択がなく、モデルごとのチューニングもできないモデル拡散法である。単一のデフォルト設定は、4つのアーキテクチャにまたがって組み込まれた事実を冗長に復元する。我々は、実際のドメインの微調整設定を検証し、単一データセット以外のすべてのCoT変種に対してほぼ完全な回復を実現し、混合データセット設定で4つのデータセット全てを正しく識別する。
参考スコア（独自算出の注目度）: 1.9599274203282298
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces of the finetuning domain; the state-of-the-art Activation Difference Lens (ADL) recovers a vague domain-level description but requires full "white-box" access to model internals. We introduce Contrastive Decoding Diffing (CDD), a model diffing method that operates on output-level logit distributions only, with no weight access, no layer selection, and no per-model tuning, yet recovers implanted facts. CDD consists of three ideas: bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step. A single default configuration recovers implanted facts verbatim -- exact drug names, vote counts, physical measurements, and procedural details -- across four architectures (1B--32B parameters), uniformly outperforming ADL despite less access and running ~170x faster. Furthermore, CDD surfaces unintended data pipeline artifacts: a fictional persona introduced by the LLM data generator via mode collapse leaked into model weights and was extracted by CDD, constituting to our knowledge the first demonstrated end-to-end fingerprinting chain from data generator artifact to model weights to recovered output. We validate on real-domain finetuning settings, achieving near-perfect recovery across all single-dataset non-CoT variants and correctly identifying all four datasets in the mixed-dataset setting. CDD's success as a grey-box method outperforming white-box baselines underscores its practical utility for transparency and accountability in AI systems.
Abstract（参考訳）: わずかに微調整された言語モデルは、埋め込みされたコンテンツを冗長に記憶するが、その重みやトレーニングデータにアクセスせずに、デプロイされたモデルが教えたことを監査することは、依然としてオープンな課題である。最近の研究は、ベースモデルと微調整モデルのアクティベーションの違いが微調整ドメインの読みやすいトレースを担っていることを示している; 最先端のアクティベーション差分レンズ(ADL)は曖昧なドメインレベルの記述を回復するが、モデル内部への完全な「ホワイトボックス」アクセスを必要とする。 Contrastive Decoding Diffing (CDD) は、出力レベルのロジット分布のみを演算し、ウェイトアクセスがなく、層選択がなく、モデルごとのチューニングもできないモデル微分法である。 CDDは、3つのアイデアで構成されている: チャットテンプレートをバイパスして、生の微調整前の情報を公開し、最大であいまいなプリフィルでシード生成し、デコードステップごとに微調整されたモデルとベースモデルのロジト空間差を増幅する。単一のデフォルト設定では、4つのアーキテクチャ(1B-32Bパラメータ)にまたがる、正確な薬物名、投票数、物理的な測定、手続きの詳細など、埋め込みされた事実を冗長に復元する。さらに、CDDは意図しないデータパイプラインアーティファクトを表面化する: LLMデータジェネレータによって導入された架空のペルソナは、モード崩壊によってモデルウェイトに流出し、CDDによって抽出された。我々は、実際のドメインの微調整設定を検証し、単一データセット以外のすべてのCoT変種に対してほぼ完全な回復を実現し、混合データセット設定で4つのデータセット全てを正しく識別する。ホワイトボックスベースラインを上回るグレーボックスメソッドとしてのCDDの成功は、AIシステムにおける透明性と説明責任の実践的有用性を示している。

論文の概要: Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

関連論文リスト