VLLaVO: Mitigating Visual Gap through LLMs
- URL: http://arxiv.org/abs/2401.03253v2
- Date: Sat, 16 Mar 2024 16:39:07 GMT
- Title: VLLaVO: Mitigating Visual Gap through LLMs
- Authors: Shuhao Chen, Yulong Zhang, Weisen Jiang, Jiangang Lu, Yu Zhang,
- Abstract summary: Cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data.
We propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners.
- Score: 7.352822795984628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances achieved by deep learning models rely on the independent and identically distributed assumption, hindering their applications in real-world scenarios with domain shifts. To tackle this issue, cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data. However, in visual cross-domain learning, traditional methods concentrate solely on the image modality, disregarding the potential benefits of incorporating the text modality. In this work, we propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners. VLLaVO uses vision-language models to convert images into detailed textual descriptions. A large language model is then finetuned on textual descriptions of the source/target domain generated by a designed instruction template. Extensive experimental results under domain generalization and unsupervised domain adaptation settings demonstrate the effectiveness of the proposed method.
Related papers
- WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Unified Language-driven Zero-shot Domain Adaptation [55.64088594551629]
Unified Language-driven Zero-shot Domain Adaptation (ULDA) is a novel task setting.
It enables a single model to adapt to diverse target domains without explicit domain-ID knowledge.
arXiv Detail & Related papers (2024-04-10T16:44:11Z) - Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation [2.104191333263349]
Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain visual features and linguistic features.
This problem arises from the domain gap between the single-modal pre-training backbones used in current visual grounding methods.
We propose an Empowering Pre-trained Model for Visual Grounding framework, which distills a multimodal pre-trained model to guide the visual grounding task.
arXiv Detail & Related papers (2023-12-29T15:32:11Z) - Domain Prompt Learning with Quaternion Networks [49.45309818782329]
We propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of Vision-Language Models to specialized domains.
We present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features.
Our proposed method achieves new state-of-the-art results in prompt learning.
arXiv Detail & Related papers (2023-12-12T08:49:39Z) - Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms.
We propose a textbfDomain-Controlled Prompt Learning for the specific domains.
Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z) - A Sentence Speaks a Thousand Images: Domain Generalization through
Distilling CLIP with Language Guidance [41.793995960478355]
We propose a novel approach for domain generalization that leverages recent advances in large vision-language models.
The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations.
We evaluate our proposed method, dubbed RISE, on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods.
arXiv Detail & Related papers (2023-09-21T23:06:19Z) - Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation [45.02052030837188]
We study open-vocabulary domain adaptation (OVDA), a new unsupervised domain adaptation framework.
We design a Prompt Ensemble Self-training (PEST) technique that exploits the synergy between vision and language.
PEST outperforms the state-of-the-art consistently across 10 image recognition tasks.
arXiv Detail & Related papers (2023-06-29T03:39:35Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.