VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
- URL: http://arxiv.org/abs/2408.16930v1
- Date: Thu, 29 Aug 2024 22:13:29 GMT
- Title: VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
- Authors: Zaiwei Zhang, Gregory P. Meyer, Zhichao Lu, Ashish Shrivastava, Avinash Ravichandran, Eric M. Wolff,
- Abstract summary: We introduce an effective method to distill knowledge from an off-the-shelf vision-language model (VLM)
We develop a framework that generates novel text supervision and distills free-form text into a vision encoder.
To our knowledge, this work is the first to utilize text supervision generated by an off-the-shelf VLM and apply it to vanilla randomly vision encoders.
- Score: 25.927771583678272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For visual recognition, knowledge distillation typically involves transferring knowledge from a large, well-trained teacher model to a smaller student model. In this paper, we introduce an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model. Our key technical contribution is the development of a framework that generates novel text supervision and distills free-form text into a vision encoder. We showcase the effectiveness of our approach, termed VLM-KD, across various benchmark datasets, showing that it surpasses several state-of-the-art long-tail visual classifiers. To our knowledge, this work is the first to utilize knowledge distillation with text supervision generated by an off-the-shelf VLM and apply it to vanilla randomly initialized vision encoders.
Related papers
- ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts [29.446235941754345]
Vision-language (VL) learning requires extensive visual perception capabilities.
Recent works typically rely on training huge models on massive datasets to develop these capabilities.
This paper proposes a new framework that transfers the knowledge from a hub of Vision Experts.
arXiv Detail & Related papers (2025-04-01T12:02:40Z) - Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models [53.13731845500678]
We introduce a novel metric, $Rank_e$, to quantify the effect of vision encoder's prior knowledge on MLLM performance.
We propose VisPRE, a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level.
Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs.
arXiv Detail & Related papers (2025-03-23T11:33:09Z) - MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders [28.22099619211775]
Visual encoders are fundamental components in vision-language models (VLMs)
Recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost.
We present a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model.
arXiv Detail & Related papers (2025-01-03T09:10:34Z) - PromptKD: Unsupervised Prompt Distillation for Vision-Language Models [40.858721356497085]
We introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model.
Our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels.
In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits.
arXiv Detail & Related papers (2024-03-05T08:53:30Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Generative Model-based Feature Knowledge Distillation for Action
Recognition [11.31068233536815]
Our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model.
The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets.
arXiv Detail & Related papers (2023-12-14T03:55:29Z) - Learning without Forgetting for Vision-Language Models [65.49600786387106]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - Retrieval-based Knowledge Augmented Vision Language Pre-training [9.779887832992435]
Key challenge of knowledge-augmented pre-training is the lack of clear connections between knowledge and multi-modal data.
In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework.
For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data.
arXiv Detail & Related papers (2023-04-27T02:23:47Z) - Vision-Language Models for Vision Tasks: A Survey [62.543250338410836]
Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs.
This paper provides a systematic review of visual language models for various visual recognition tasks.
arXiv Detail & Related papers (2023-04-03T02:17:05Z) - Distilling Knowledge from Self-Supervised Teacher by Embedding Graph
Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network.
Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space.
Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.