Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
- URL: http://arxiv.org/abs/2602.15143v1
- Date: Mon, 16 Feb 2026 19:40:07 GMT
- Title: Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
- Authors: Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik,
- Abstract summary: Authoritarian use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models.<n>We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence.<n>Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance.
- Score: 31.05181251141126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.
Related papers
- DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher [5.406594712642111]
Distilled Unlearning from an Efficient Teacher (DUET) is a novel distillation-based unlearning method.<n>It achieves higher performance in both forgetting and utility preservation, while being orders of magnitude more data-efficient than state-of-the-art unlearning methods.
arXiv Detail & Related papers (2026-01-29T05:32:35Z) - AdaSwitch: Adaptive Switching Generation for Knowledge Distillation [58.647880811071495]
Small language models (SLMs) are crucial for applications with strict latency and computational constraints.<n>We propose AdaSwitch, a novel approach that combines on-policy and off-policy generation at the token level.<n>AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
arXiv Detail & Related papers (2025-10-09T06:38:37Z) - DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation [49.58082402742583]
Large Language Models (LLMs) represent substantial intellectual and economic investments.<n>LLMs can inadvertently facilitate model imitation via knowledge distillation (KD)<n>This paper introduces an effective and efficient Defensive Output Generation (DOGe) strategy.
arXiv Detail & Related papers (2025-05-26T04:31:38Z) - Unified attacks to large language model watermarks: spoofing and scrubbing in unauthorized knowledge distillation [33.394877468499395]
We propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD) as a unified framework that enables bidirectional attacks under unauthorized knowledge distillation.<n>Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references.<n>Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
arXiv Detail & Related papers (2025-04-24T12:15:46Z) - UNDO: Understanding Distillation as Optimization [9.100811514331498]
We introduce the UNDO: UNderstanding Distillation as Optimization framework.<n>Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales.<n> Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods.
arXiv Detail & Related papers (2025-04-03T12:18:51Z) - Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching [16.385782508179364]
Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM.<n>Current distillation methods require similar tokenizers between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs.<n>We develop a principled cross-tokenizer distillation method to solve this crucial deficiency.<n>Our method is the first to enable effective distillation across fundamentally different tokenizers, while also substantially outperforming prior methods in all other cases.
arXiv Detail & Related papers (2025-03-25T21:44:10Z) - Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? [75.99961894619986]
This paper investigates whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance.<n>We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN)
arXiv Detail & Related papers (2025-02-17T09:34:19Z) - Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.<n>Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.<n>Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Distilling Image Classifiers in Object Detectors [81.63849985128527]
We study the case of object detection and, instead of following the standard detector-to-detector distillation approach, introduce a classifier-to-detector knowledge transfer framework.
In particular, we propose strategies to exploit the classification teacher to improve both the detector's recognition accuracy and localization performance.
arXiv Detail & Related papers (2021-06-09T16:50:10Z) - Autoregressive Knowledge Distillation through Imitation Learning [70.12862707908769]
We develop a compression technique for autoregressive models driven by an imitation learning perspective on knowledge distillation.
Our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation.
Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.
arXiv Detail & Related papers (2020-09-15T17:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.