Grounding Visual Representations with Texts for Domain Generalization
- URL: http://arxiv.org/abs/2207.10285v1
- Date: Thu, 21 Jul 2022 03:43:38 GMT
- Title: Grounding Visual Representations with Texts for Domain Generalization
- Authors: Seonwoo Min, Nokyung Park, Siwon Kim, Seunghyun Park, Jinkyu Kim
- Abstract summary: Cross-modality supervision can be successfully used to ground domain-invariant visual representations.
Our proposed method achieves state-of-the-art results and ranks 1st in average performance for five multi-domain datasets.
- Score: 9.554646174100123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reducing the representational discrepancy between source and target domains
is a key component to maximize the model generalization. In this work, we
advocate for leveraging natural language supervision for the domain
generalization task. We introduce two modules to ground visual representations
with texts containing typical reasoning of humans: (1) Visual and Textual Joint
Embedder and (2) Textual Explanation Generator. The former learns the
image-text joint embedding space where we can ground high-level
class-discriminative information into the model. The latter leverages an
explainable model and generates explanations justifying the rationale behind
its decision. To the best of our knowledge, this is the first work to leverage
the vision-and-language cross-modality approach for the domain generalization
task. Our experiments with a newly created CUB-DG benchmark dataset demonstrate
that cross-modality supervision can be successfully used to ground
domain-invariant visual representations and improve the model generalization.
Furthermore, in the large-scale DomainBed benchmark, our proposed method
achieves state-of-the-art results and ranks 1st in average performance for five
multi-domain datasets. The dataset and codes are available at
https://github.com/mswzeus/GVRT.
Related papers
- Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification.
We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control [68.14798033899955]
Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content.
However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation?
We investigate this question in the context of autonomous driving, and answer it with a resounding "yes"
arXiv Detail & Related papers (2023-12-05T18:34:12Z) - A Sentence Speaks a Thousand Images: Domain Generalization through
Distilling CLIP with Language Guidance [41.793995960478355]
We propose a novel approach for domain generalization that leverages recent advances in large vision-language models.
The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations.
We evaluate our proposed method, dubbed RISE, on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods.
arXiv Detail & Related papers (2023-09-21T23:06:19Z) - TDG: Text-guided Domain Generalization [10.322052096998728]
We develop a Text-guided Domain Generalization (TDG) paradigm for domain generalization.
We first devise an automatic words generation method to extend the description of current domains with novel domain-relevant words.
Then, we embed the generated domain information into the text feature space, by the proposed prompt learning-based text feature generation method.
Finally, we utilize both input image features and generated text features to train a specially designed classifier that generalizes well on unseen target domains.
arXiv Detail & Related papers (2023-08-19T07:21:02Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Semi-supervised Meta-learning with Disentanglement for
Domain-generalised Medical Image Segmentation [15.351113774542839]
Generalising models to new data from new centres (termed here domains) remains a challenge.
We propose a novel semi-supervised meta-learning framework with disentanglement.
We show that the proposed method is robust on different segmentation tasks and achieves state-of-the-art generalisation performance on two public benchmarks.
arXiv Detail & Related papers (2021-06-24T19:50:07Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Generalizable Model-agnostic Semantic Segmentation via Target-specific
Normalization [24.14272032117714]
We propose a novel domain generalization framework for the generalizable semantic segmentation task.
We exploit the model-agnostic learning to simulate the domain shift problem.
Considering the data-distribution discrepancy between seen source and unseen target domains, we develop the target-specific normalization scheme.
arXiv Detail & Related papers (2020-03-27T09:25:19Z) - Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN [117.80737222754306]
We present a novel universal object detector called Universal-RCNN.
We first generate a global semantic pool by integrating all high-level semantic representation of all the categories.
An Intra-Domain Reasoning Module learns and propagates the sparse graph representation within one dataset guided by a spatial-aware GCN.
arXiv Detail & Related papers (2020-02-18T07:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.