Leveraging Vision-Language Models for Improving Domain Generalization in
Image Classification
- URL: http://arxiv.org/abs/2310.08255v2
- Date: Sat, 9 Mar 2024 09:53:26 GMT
- Title: Leveraging Vision-Language Models for Improving Domain Generalization in
Image Classification
- Authors: Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, R.
Venkatesh Babu
- Abstract summary: Vision-Language Models (VLMs) are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions.
We propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model.
This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings.
- Score: 35.277880733198586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) such as CLIP are trained on large amounts of
image-text pairs, resulting in remarkable generalization across several data
distributions. However, in several cases, their expensive training and data
collection/curation costs do not justify the end application. This motivates a
vendor-client paradigm, where a vendor trains a large-scale VLM and grants only
input-output access to clients on a pay-per-query basis in a black-box setting.
The client aims to minimize inference cost by distilling the VLM to a student
model using the limited available task-specific data, and further deploying
this student model in the downstream application. While naive distillation
largely improves the In-Domain (ID) accuracy of the student, it fails to
transfer the superior out-of-distribution (OOD) generalization of the VLM
teacher using the limited available labeled images. To mitigate this, we
propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which
first aligns the vision and language modalities of the teacher model with the
vision modality of a pre-trained student model, and further distills the
aligned VLM representations to the student. This maximally retains the
pre-trained features of the student, while also incorporating the rich
representations of the VLM image encoder and the superior generalization of the
text embeddings. The proposed approach achieves state-of-the-art results on the
standard Domain Generalization benchmarks in a black-box teacher setting as
well as a white-box setting where the weights of the VLM are accessible.
Related papers
- How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners [8.707819647492467]
We explore capturing the task-specific information via meticulous refinement of entire Vision-Language Models (VLMs)
To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task.
arXiv Detail & Related papers (2024-07-04T15:22:54Z) - Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs [83.24033574914425]
We present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving.
Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information.
Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.
arXiv Detail & Related papers (2024-06-20T17:54:03Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Why are Visually-Grounded Language Models Bad at Image Classification? [39.76294811955341]
We revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA.
We find that existing proprietary and public VLMs significantly underperform CLIP on standard image classification benchmarks like ImageNet.
Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data.
arXiv Detail & Related papers (2024-05-28T17:57:06Z) - Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models [55.5610165938949]
Fine-tuning vision-language models (VLMs) has gained increasing popularity due to its practical value.
This paper explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model.
We introduce three customized ensemble strategies, each tailored to one specific scenario.
The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-28T05:17:25Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.