Understanding the Distillation Process from Deep Generative Models to
Tractable Probabilistic Circuits
- URL: http://arxiv.org/abs/2302.08086v1
- Date: Thu, 16 Feb 2023 04:52:46 GMT
- Title: Understanding the Distillation Process from Deep Generative Models to
Tractable Probabilistic Circuits
- Authors: Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang
- Abstract summary: We theoretically and empirically discover that the performance of a PC can exceed that of its teacher model.
In particular, on ImageNet32, PCs achieve 4.06 bits-per-dimension, which is only 0.34 behind variational diffusion models.
- Score: 30.663322946413285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Probabilistic Circuits (PCs) are a general and unified computational
framework for tractable probabilistic models that support efficient computation
of various inference tasks (e.g., computing marginal probabilities). Towards
enabling such reasoning capabilities in complex real-world tasks, Liu et al.
(2022) propose to distill knowledge (through latent variable assignments) from
less tractable but more expressive deep generative models. However, it is still
unclear what factors make this distillation work well. In this paper, we
theoretically and empirically discover that the performance of a PC can exceed
that of its teacher model. Therefore, instead of performing distillation from
the most expressive deep generative model, we study what properties the teacher
model and the PC should have in order to achieve good distillation performance.
This leads to a generic algorithmic improvement as well as other
data-type-specific ones over the existing latent variable distillation
pipeline. Empirically, we outperform SoTA TPMs by a large margin on challenging
image modeling benchmarks. In particular, on ImageNet32, PCs achieve 4.06
bits-per-dimension, which is only 0.34 behind variational diffusion models
(Kingma et al., 2021).
Related papers
- One-Step Diffusion Distillation through Score Implicit Matching [74.91234358410281]
We present Score Implicit Matching (SIM) a new approach to distilling pre-trained diffusion models into single-step generator models.
SIM shows strong empirical performances for one-step generators.
By applying SIM to a leading transformer-based diffusion model, we distill a single-step generator for text-to-image generation.
arXiv Detail & Related papers (2024-10-22T08:17:20Z) - One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image.
Our method enables fully offline training with just noise/image pairs from the diffusion model.
We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Progressive Volume Distillation with Active Learning for Efficient NeRF Architecture Conversion [27.389511043400635]
Neural Fields (NeRF) have been widely adopted as practical and versatile representations for 3D scenes.
We propose Progressive Volume Distillation with Active Learning (PVD-AL), a systematic distillation method.
PVD-AL decomposes each structure into two parts and progressively performs distillation from shallower to deeper volume representation.
arXiv Detail & Related papers (2023-04-08T13:59:18Z) - Scaling Up Probabilistic Circuits by Latent Variable Distillation [29.83240905570575]
As the number of parameters in PCs increases, their performance immediately plateaus.
We leverage the less tractable but more expressive deep generative models to provide extra supervision over the latent variables of PCs.
In particular, on the image modeling benchmarks, PCs achieve competitive performance against some of the widely-used deep generative models.
arXiv Detail & Related papers (2022-10-10T02:07:32Z) - Functional Ensemble Distillation [18.34081591772928]
We investigate how to best distill an ensemble's predictions using an efficient model.
We find that learning the distilled model via a simple augmentation scheme in the form of mixup augmentation significantly boosts the performance.
arXiv Detail & Related papers (2022-06-05T14:07:17Z) - Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.