Towards Undistillable Models by Minimizing Conditional Mutual Information
- URL: http://arxiv.org/abs/2507.00012v1
- Date: Fri, 13 Jun 2025 00:56:29 GMT
- Title: Towards Undistillable Models by Minimizing Conditional Mutual Information
- Authors: Linfeng Ye, Shayan Mohajer Hamidi, En-hui Yang,
- Abstract summary: A deep neural network (DNN) is said to be undistillable if, when used as a black-box input-output teacher, it cannot be distilled through knowledge distillation (KD)<n>A new training method called CMI minimized (CMIM) method is proposed, which trains a DNN by jointly minimizing the conventional cross entropy (CE) loss.<n>The CMIM model is shown, by extensive experiments, to be undistillable by all tested KD methods existing in the literature.
- Score: 3.4398508628750313
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A deep neural network (DNN) is said to be undistillable if, when used as a black-box input-output teacher, it cannot be distilled through knowledge distillation (KD). In this case, the distilled student (referred to as the knockoff student) does not outperform a student trained independently with label smoothing (LS student) in terms of prediction accuracy. To protect intellectual property of DNNs, it is desirable to build undistillable DNNs. To this end, it is first observed that an undistillable DNN may have the trait that each cluster of its output probability distributions in response to all sample instances with the same label should be highly concentrated to the extent that each cluster corresponding to each label should ideally collapse into one probability distribution. Based on this observation and by measuring the concentration of each cluster in terms of conditional mutual information (CMI), a new training method called CMI minimized (CMIM) method is proposed, which trains a DNN by jointly minimizing the conventional cross entropy (CE) loss and the CMI values of all temperature scaled clusters across the entire temperature spectrum. The resulting CMIM model is shown, by extensive experiments, to be undistillable by all tested KD methods existing in the literature. That is, the knockoff students distilled by these KD methods from the CMIM model underperform the respective LS students. In addition, the CMIM model is also shown to performs better than the model trained with the CE loss alone in terms of their own prediction accuracy.
Related papers
- Distillation of Discrete Diffusion by Exact Conditional Distribution Matching [9.460409527892345]
We propose a simple and principled distillation alternative based on emphconditional distribution matching.<n>We exploit this structure to define distillation objectives that directly match conditional distributions between a pre-trained teacher and a low-NFE student.
arXiv Detail & Related papers (2025-12-15T00:16:10Z) - GNN's Uncertainty Quantification using Self-Distillation [0.6906005491572398]
We propose a novel method, based on knowledge distillation, to quantify Graph Neural Networks' uncertainty more efficiently and with higher precision.<n>We experimentally evaluate the precision, performance, and ability of our approach in distinguishing out-of-distribution data on two graph datasets.
arXiv Detail & Related papers (2025-06-24T23:08:31Z) - Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation [53.30082523545212]
Knowledge distillation (KD) is a core component in the training and deployment of modern generative models.<n>We show that KD induces a trade-off between precision and recall in the student model.<n>Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.
arXiv Detail & Related papers (2025-05-19T13:39:47Z) - CEC-MMR: Cross-Entropy Clustering Approach to Multi-Modal Regression [8.127496643086701]
We introduce CEC-MMR, which allows for the automatic detection of the number of components in a regression problem.<n>Given an attribute and its value, our method is capable of uniquely identifying it with the underlying component.<n>Results demonstrate that CEC-MMR yields superior outcomes compared to classical MDNs.
arXiv Detail & Related papers (2025-04-09T21:51:38Z) - Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport [46.91791643660991]
Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments.
These models struggle in the wild because of the unavailability and quality of modalities used for training.
In practice, only a subset of the training-time modalities may be available at test time.
Learning with privileged information enables models to exploit data from additional modalities that are only available during training.
arXiv Detail & Related papers (2024-01-27T19:44:15Z) - STEM Rebalance: A Novel Approach for Tackling Imbalanced Datasets using
SMOTE, Edited Nearest Neighbour, and Mixup [0.20482269513546458]
Imbalanced datasets in medical imaging are characterized by skewed class proportions and scarcity of abnormal cases.
This paper investigates the potential of using Mixup augmentation to generate new data points as a generic vicinal distribution.
We focus on the breast cancer problem, where imbalanced datasets are prevalent.
arXiv Detail & Related papers (2023-11-13T17:45:28Z) - Conditional Mutual Information Constrained Deep Learning for
Classification [3.5237980787861964]
conditional mutual information (CMI) and normalized conditional mutual information (NCMI) are introduced to measure the concentration and performance of a classification deep neural network (DNN)
By using NCMI to evaluate popular DNNs pretrained over ImageNet in the literature, it is shown that their validation accuracies over ImageNet validation data set are more or less inversely proportional to their NCMI values.
A novel alternating learning algorithm is proposed to solve such a constrained optimization problem.
arXiv Detail & Related papers (2023-09-17T01:16:45Z) - On-Policy Distillation of Language Models: Learning from Self-Generated
Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher.
We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z) - Graph Neural Networks for Temperature-Dependent Activity Coefficient
Prediction of Solutes in Ionic Liquids [58.720142291102135]
We present a GNN to predict temperature-dependent infinite dilution ACs of solutes in ILs.
We train the GNN on a database including more than 40,000 AC values and compare it to a state-of-the-art MCM.
The GNN and MCM achieve similar high prediction performance, with the GNN additionally enabling high-quality predictions for ACs of solutions that contain ILs and solutes not considered during training.
arXiv Detail & Related papers (2022-06-23T15:27:29Z) - Efficient training of lightweight neural networks using Online
Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner.
We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Continual Learning with Fully Probabilistic Models [70.3497683558609]
We present an approach for continual learning based on fully probabilistic (or generative) models of machine learning.
We propose a pseudo-rehearsal approach using a Gaussian Mixture Model (GMM) instance for both generator and classifier functionalities.
We show that GMR achieves state-of-the-art performance on common class-incremental learning problems at very competitive time and memory complexity.
arXiv Detail & Related papers (2021-04-19T12:26:26Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - One Versus all for deep Neural Network Incertitude (OVNNI)
quantification [12.734278426543332]
We propose a new technique to quantify the epistemic uncertainty of data easily.
This method consists in mixing the predictions of an ensemble of DNNs trained to classify One class vs All the other classes (OVA) with predictions from a standard DNN trained to perform All vs All (AVA) classification.
arXiv Detail & Related papers (2020-06-01T14:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.