A Sentence Speaks a Thousand Images: Domain Generalization through
Distilling CLIP with Language Guidance
- URL: http://arxiv.org/abs/2309.12530v1
- Date: Thu, 21 Sep 2023 23:06:19 GMT
- Title: A Sentence Speaks a Thousand Images: Domain Generalization through
Distilling CLIP with Language Guidance
- Authors: Zeyi Huang, Andy Zhou, Zijian Lin, Mu Cai, Haohan Wang, Yong Jae Lee
- Abstract summary: We propose a novel approach for domain generalization that leverages recent advances in large vision-language models.
The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations.
We evaluate our proposed method, dubbed RISE, on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods.
- Score: 41.793995960478355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Domain generalization studies the problem of training a model with samples
from several domains (or distributions) and then testing the model with samples
from a new, unseen domain. In this paper, we propose a novel approach for
domain generalization that leverages recent advances in large vision-language
models, specifically a CLIP teacher model, to train a smaller model that
generalizes to unseen domains. The key technical contribution is a new type of
regularization that requires the student's learned image representations to be
close to the teacher's learned text representations obtained from encoding the
corresponding text descriptions of images. We introduce two designs of the loss
function, absolute and relative distance, which provide specific guidance on
how the training process of the student model should be regularized. We
evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic
Embeddings), on various benchmark datasets and show that it outperforms several
state-of-the-art domain generalization methods. To our knowledge, our work is
the first to leverage knowledge distillation using a large vision-language
model for domain generalization. By incorporating text-based information, RISE
improves the generalization capability of machine learning models.
Related papers
- VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models [18.259733507395634]
We introduce a new metric called Visual Language Evaluation Understudy (VLEU)
VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model.
Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models.
arXiv Detail & Related papers (2024-09-23T04:50:36Z) - Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification.
We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z) - WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - VLLaVO: Mitigating Visual Gap through LLMs [7.352822795984628]
Cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data.
We propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners.
arXiv Detail & Related papers (2024-01-06T16:33:39Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms.
We propose a textbfDomain-Controlled Prompt Learning for the specific domains.
Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z) - Grounding Visual Representations with Texts for Domain Generalization [9.554646174100123]
Cross-modality supervision can be successfully used to ground domain-invariant visual representations.
Our proposed method achieves state-of-the-art results and ranks 1st in average performance for five multi-domain datasets.
arXiv Detail & Related papers (2022-07-21T03:43:38Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Generalizable Model-agnostic Semantic Segmentation via Target-specific
Normalization [24.14272032117714]
We propose a novel domain generalization framework for the generalizable semantic segmentation task.
We exploit the model-agnostic learning to simulate the domain shift problem.
Considering the data-distribution discrepancy between seen source and unseen target domains, we develop the target-specific normalization scheme.
arXiv Detail & Related papers (2020-03-27T09:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.