Fugu-MT 論文翻訳(概要): Asking Back: Interaction-Layer Antidistillation Watermarks

論文の概要: Asking Back: Interaction-Layer Antidistillation Watermarks

arxiv url: http://arxiv.org/abs/2605.16462v1
Date: Fri, 15 May 2026 08:28:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:46.494212
Title: Asking Back: Interaction-Layer Antidistillation Watermarks
Title（参考訳）: 振り返ってみた: インターオペラビリティー・レイヤーの消毒水標
Authors: Guang Yang, Amir Ghasemian, Fengchen Liu, Zhong Wang, Ninareh Mehrabi, Homa Hosseinmardi,
Abstract要約: 既存の防御は教師の出力トークンを操作する。最近の研究によると、悪意のない攻撃者は、根底にある知識を失うことなく、これらの信号を取り除くことができる。相互作用層型抗蒸留透かしを提案する。
参考スコア（独自算出の注目度）: 7.826668598190874
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Detecting unauthorized knowledge distillation from a deployed LLM API is hard because the defender controls neither the attacker's training pipeline nor the next-token logits. Existing defenses operate on the teacher's output tokens -- biasing the next-token distribution (green-list watermarks, cryptographic schemes, antidistillation sampling) or rewriting outputs after generation. Recent work shows a paraphrasing attacker can strip these signals without losing the underlying knowledge. We propose interaction-layer antidistillation watermarks, which move the trace one layer higher, into the teacher's interaction behavior: the defender wraps the teacher with a system prompt that intermittently induces a behavioral marker -- an explicit follow-up question, a low-frequency variant, or a declarative restatement. An oblivious distiller inherits the behavior, and the defender audits via black-box queries with a human-validated LLM-as-judge (Cohen's kappa = 0.84/0.78 on strong/style rubrics). Across 63 LoRA-distilled students under a Llama-3.3-70B-Instruct teacher (35,343 judged samples), behavioral watermarks transfer at 88.9% (Gemma) / 80.9% (OLMo) / 45.2% (Qwen) relative fidelity (H1, H2). Under non-adaptive DIPPER paraphrasing, robustness decomposes into a teacher-self ceiling (about 66.4%) and student-relative retention of 21-112%, with OLMo preserving the watermark above the teacher itself (H3, F-Amp). Low-density (about 20%) explicit and implicit declarative variants transfer above per-family baseline (H4, F-Style). An N=20 in-lab study (pre-registered Latin-square) shows all marker variants within 0.22 Likert step of baseline; TOST, Friedman, and Bonferroni-Wilcoxon support H5. The interaction layer is a viable design locus for antidistillation watermarking, complementary to token-, model-, and reasoning-trace-layer defenses.
Abstract（参考訳）: 攻撃者のトレーニングパイプラインも、次の警告ログもコントロールできないため、デプロイされたLLM APIからの無許可の知識蒸留の検出は難しい。既存の防御は教師の出力トークン(グリーンリストの透かし、暗号スキーム、アンチ蒸留サンプリング)を偏り、生成後に出力を書き換える。最近の研究は、パラフレーズ攻撃者が基礎となる知識を失うことなくこれらの信号を除去できることを示している。そこで本稿では,教師の対話行動において,教師の対話行動に一段高めのトレーサを移動させる「相互作用層防汚透かし」を提案する。余計な蒸留器がその振る舞いを継承し、ディフェンダーはブラックボックスクエリを通じて人間公認のLSM-as-judge(コーエンのカッパ=0.84/0.78)で監査する。 Llama-3.3-70B-Instruct teacher (35,343例) による63名以上のLoRA蒸留学生が88.9% (Gemma) / 80.9% (OLMo) / 45.2% (Qwen) の相対忠実度 (H1, H2) で電子透かしを転送した。非適応的DIPPER言い換えでは、ロバスト性は教師自身の天井(約66.4%)に分解され、21-112%の学生が保持され、OLMOは教師自身の上にある透かし(H3, F-Amp)を保持する。低密度(約20%)の明示的および暗黙的な宣言的変異は、家族ごとのベースライン(H4, F-Style)の上に移動する。 N=20 in-lab study (pre-registered Latin-square) では、ベースラインの 0.22 Likert ステップ内のすべてのマーカー変異が示され、TOST、Friedman、Bonferroni-Wilcoxon は H5 をサポートする。相互作用層は、アンチ蒸留ウォーターマーキングのための実行可能な設計軌跡であり、トークン、モデル、および推論トラス層防御を補完する。

論文の概要: Asking Back: Interaction-Layer Antidistillation Watermarks

関連論文リスト