Related papers: Knowledge Distillation of Uncertainty using Deep Latent Factor Model

Knowledge Distillation of Uncertainty using Deep Latent Factor Model

URL: http://arxiv.org/abs/2510.19290v2
Date: Fri, 24 Oct 2025 01:47:27 GMT
Title: Knowledge Distillation of Uncertainty using Deep Latent Factor Model
Authors: Sehyun Park, Jongjin Lee, Yunseop Shin, Ilsang Ohn, Yongdai Kim,
Abstract summary: We introduce a new method of distribution distillation called Gaussian distillation.<n>It estimates the distribution of a teacher ensemble through a special Gaussian process called the deep latent factor model (DLF)<n>By using multiple benchmark datasets, we demonstrate that the proposed Gaussian distillation outperforms existing baselines.
Score: 10.148306002388196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep ensembles deliver state-of-the-art, reliable uncertainty quantification, but their heavy computational and memory requirements hinder their practical deployments to real applications such as on-device AI. Knowledge distillation compresses an ensemble into small student models, but existing techniques struggle to preserve uncertainty partly because reducing the size of DNNs typically results in variation reduction. To resolve this limitation, we introduce a new method of distribution distillation (i.e. compressing a teacher ensemble into a student distribution instead of a student ensemble) called Gaussian distillation, which estimates the distribution of a teacher ensemble through a special Gaussian process called the deep latent factor model (DLF) by treating each member of the teacher ensemble as a realization of a certain stochastic process. The mean and covariance functions in the DLF model are estimated stably by using the expectation-maximization (EM) algorithm. By using multiple benchmark datasets, we demonstrate that the proposed Gaussian distillation outperforms existing baselines. In addition, we illustrate that Gaussian distillation works well for fine-tuning of language models and distribution shift problems.

Related papers

Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation [50.19746127327559]
We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions.<n> Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models.
arXiv Detail & Related papers (2026-02-24T11:54:06Z)
Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield [54.328202401611264]
Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators.<n>We show that the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA)<n>We propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains.
arXiv Detail & Related papers (2025-11-27T18:24:28Z)
Information Theoretic Learning for Diffusion Models with Warm Start [8.455757095201314]
We derive a tighter likelihood bound for noise driven models to improve the accuracy and efficiency of maximum likelihood learning.<n>Our key insight extends the classical KL divergence Fisher information relationship to arbitrary noise perturbations.<n>Treating the diffusion process as a Gaussian channel, we show that the proposed objective upper bounds the negative log-likelihood (NLL)
arXiv Detail & Related papers (2025-10-23T18:00:59Z)
Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation [53.30082523545212]
Knowledge distillation (KD) is a core component in the training and deployment of modern generative models.<n>We show that KD induces a trade-off between precision and recall in the student model.<n>Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.
arXiv Detail & Related papers (2025-05-19T13:39:47Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Reward-Directed Score-Based Diffusion Models via q-Learning [8.725446812770791]
We propose a new reinforcement learning (RL) formulation for training continuous-time score-based diffusion models for generative AI.<n>Ours does not involve any pretrained model for the unknown score functions of the noise-perturbed data distributions.<n>We show the effectiveness of our approach by comparing its performance with two state-of-the-art RL methods.
arXiv Detail & Related papers (2024-09-07T13:55:45Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis Approach [49.97755400231656]
We show that a new accelerated DDPM sampler achieves accelerated performance for three broad distribution classes not considered before.<n>Our results show an improved dependency on the data dimension $d$ among accelerated DDPM type samplers.
arXiv Detail & Related papers (2024-02-21T16:11:47Z)
Neural Operator Variational Inference based on Regularized Stein Discrepancy for Deep Gaussian Processes [22.256068524699472]
We introduce Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes.<n>NOVI uses a neural generator to obtain a sampler and minimizes the Regularized Stein Discrepancy in L2 space between the generated distribution and true posterior.<n>We demonstrate that the bias introduced by our method can be controlled by multiplying the divergence with a constant, which leads to robust error control and ensures the stability and precision of the algorithm.
arXiv Detail & Related papers (2023-09-22T06:56:35Z)
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher. We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z)
Training Discrete Deep Generative Models via Gapped Straight-Through Estimator [72.71398034617607]
We propose a Gapped Straight-Through ( GST) estimator to reduce the variance without incurring resampling overhead. This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax. Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks.
arXiv Detail & Related papers (2022-06-15T01:46:05Z)
Learning Generative Models using Denoising Density Estimators [29.068491722778827]
We introduce a new generative model based on denoising density estimators (DDEs) Our main contribution is a novel technique to obtain generative models by minimizing the KL-divergence directly. Experimental results demonstrate substantial improvement in density estimation and competitive performance in generative model training.
arXiv Detail & Related papers (2020-01-08T20:30:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.