Detecting Distillation Data from Reasoning Models
- URL: http://arxiv.org/abs/2510.04850v2
- Date: Wed, 15 Oct 2025 08:23:27 GMT
- Title: Detecting Distillation Data from Reasoning Models
- Authors: Hengxiang Zhang, Hyeong Kyu Choi, Sharon Li, Hongxin Wei,
- Abstract summary: Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models.<n>However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models.<n>We propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens.
- Score: 24.64119860277633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.
Related papers
- Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective [52.25797439810419]
Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored.<n>We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels.<n>We derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility.
arXiv Detail & Related papers (2026-02-03T11:16:59Z) - Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks [55.27114962330541]
We propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task.<n>Deep neural networks achieve remarkable performance but have time and storage-consuming training processes.
arXiv Detail & Related papers (2026-01-15T05:29:50Z) - MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models [50.2406741245418]
We propose a mode-guided diffusion model leveraging a pre-trained diffusion model.<n>Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples.<n>Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs.
arXiv Detail & Related papers (2025-05-25T03:40:23Z) - Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation [82.39763984380625]
We introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data.<n>DSD pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs.
arXiv Detail & Related papers (2025-03-10T17:44:46Z) - DSDE: Using Proportion Estimation to Improve Model Selection for Out-of-Distribution Detection [15.238164468992148]
Experimental results on CIFAR10 and CIFAR100 demonstrate the effectiveness of our approach in tackling OoD detection challenges.
We name the proposed approach as DOS-Storey-based Detector Ensemble (DSDE)
arXiv Detail & Related papers (2024-11-03T09:01:36Z) - Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification [0.0]
Main goal is to push further the performance of prototype-based soft-labels distillation in terms of classification accuracy.
Experimental studies trace the capability of the method to distill the data, but also the opportunity to act as an augmentation method.
arXiv Detail & Related papers (2024-03-25T19:15:19Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - Explicit and Implicit Knowledge Distillation via Unlabeled Data [5.702176304876537]
We propose an efficient unlabeled sample selection method to replace high computational generators.
We also propose a class-dropping mechanism to suppress the label noise caused by the data domain shifts.
Experimental results show that our method can quickly converge and obtain higher accuracy than other state-of-the-art methods.
arXiv Detail & Related papers (2023-02-17T09:10:41Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.