Compressing Visual-linguistic Model via Knowledge Distillation
- URL: http://arxiv.org/abs/2104.02096v1
- Date: Mon, 5 Apr 2021 18:02:17 GMT
- Title: Compressing Visual-linguistic Model via Knowledge Distillation
- Authors: Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang,
Zicheng Liu
- Abstract summary: We study knowledge distillation to compress a transformer-based large visual-linguistic model into a small model.
We show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks.
- Score: 43.73998154661652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite exciting progress in pre-training for visual-linguistic (VL)
representations, very few aspire to a small VL model. In this paper, we study
knowledge distillation (KD) to effectively compress a transformer-based large
VL model into a small VL model. The major challenge arises from the
inconsistent regional visual tokens extracted from different detectors of
Teacher and Student, resulting in the misalignment of hidden representations
and attention distributions. To address the problem, we retrain and adapt the
Teacher by using the same region proposals from Student's detector while the
features are from Teacher's own object detector. With aligned network inputs,
the adapted Teacher is capable of transferring the knowledge through the
intermediate representations. Specifically, we use the mean square error loss
to mimic the attention distribution inside the transformer block and present a
token-wise noise contrastive loss to align the hidden state by contrasting with
negative representations stored in a sample queue. To this end, we show that
our proposed distillation significantly improves the performance of small VL
models on image captioning and visual question answering tasks. It reaches
120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its
non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from
the baseline. Our extensive experiments and ablations confirm the effectiveness
of VL distillation in both pre-training and fine-tuning stages.
Related papers
- OVD: On-policy Verbal Distillation [47.727229201069555]
On-policy Verbal Distillation (OVD) is a memory-efficient framework that replaces token-level probability matching with trajectory matching.<n>OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback.
arXiv Detail & Related papers (2026-01-29T16:48:14Z) - Stable Mean Teacher for Semi-supervised Video Action Detection [3.5743998666556855]
We focus on semi-supervised learning for video action detection.
We present Stable Mean Teacher, a simple end-to-end teacher-based framework that benefits from improved and temporally consistent pseudo labels.
arXiv Detail & Related papers (2024-12-10T00:25:33Z) - Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.
We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.
The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - Anomaly Detection by Adapting a pre-trained Vision Language Model [48.225404732089515]
We present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model.
We introduce the learnable prompt and propose to associate it with abnormal patterns through self-supervised learning.
We achieve the state-of-the-art 97.5/55.6 and 89.3/33.1 on MVTec-AD and VisA for anomaly detection and localization.
arXiv Detail & Related papers (2024-03-14T15:35:07Z) - Distilling Efficient Vision Transformers from CNNs for Semantic
Segmentation [12.177329445930276]
We propose a novel CNN-to-ViT KD framework, dubbed C2VKD.
We first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations.
We then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes.
arXiv Detail & Related papers (2023-10-11T07:45:37Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language [10.57079240576682]
Visual and linguistic pre-training aims to learn vision and language representations together.
Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks.
We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
arXiv Detail & Related papers (2023-04-10T05:54:03Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Anomaly Detection via Reverse Distillation from One-Class Embedding [2.715884199292287]
We propose a novel T-S model consisting of a teacher encoder and a student decoder.
Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input.
In addition, we introduce a trainable one-class bottleneck embedding module in our T-S model.
arXiv Detail & Related papers (2022-01-26T01:48:37Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices.
Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.