Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos
- URL: http://arxiv.org/abs/2403.05535v3
- Date: Thu, 6 Jun 2024 01:44:48 GMT
- Title: Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos
- Authors: Tarun Kalluri, Bodhisattwa Prasad Majumder, Manmohan Chandraker,
- Abstract summary: We introduce LaGTran, a framework that guides robust transfer of discriminative knowledge from labeled source to unlabeled target data with domain gaps.
Motivated by our observation that semantically richer text modality has more favorable transfer properties, we devise a transfer mechanism to use a source-trained text-classifier to generate predictions on the target text descriptions.
Our approach driven by language guidance is surprisingly easy and simple, yet significantly outperforms all prior approaches on challenging datasets like GeoNet and DomainNet.
- Score: 69.29778009769862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce LaGTran, a novel framework that utilizes text supervision to guide robust transfer of discriminative knowledge from labeled source to unlabeled target data with domain gaps. While unsupervised adaptation methods have been established to address this problem, they show limitations in handling challenging domain shifts due to their exclusive operation within the pixel-space. Motivated by our observation that semantically richer text modality has more favorable transfer properties, we devise a transfer mechanism to use a source-trained text-classifier to generate predictions on the target text descriptions, and utilize these predictions as supervision for the corresponding images. Our approach driven by language guidance is surprisingly easy and simple, yet significantly outperforms all prior approaches on challenging datasets like GeoNet and DomainNet, validating its extreme effectiveness. To further extend the scope of our study beyond images, we introduce a new benchmark called Ego2Exo to study ego-exo transfer in videos and find that our language-aided approach LaGTran yields significant gains in this highly challenging and non-trivial transfer setting. Code, models, and proposed datasets are publicly available at https://tarun005.github.io/lagtran/.
Related papers
- UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models [75.77651291095565]
We leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models.
Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP.
We propose Unsupervised Multi-domain Feature (UMFC) to mitigate this model bias.
arXiv Detail & Related papers (2024-11-11T12:25:02Z) - VLLaVO: Mitigating Visual Gap through LLMs [7.352822795984628]
Cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data.
We propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners.
arXiv Detail & Related papers (2024-01-06T16:33:39Z) - Semantic-aware Message Broadcasting for Efficient Unsupervised Domain
Adaptation [40.939984198850496]
We propose a novel method, Semantic-aware Message Broadcasting (SAMB), which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA)
We introduce a group of learned group tokens as nodes to aggregate the global information from all image tokens.
In this way, our message broadcasting encourages the group tokens to learn more informative and diverse information for effective domain alignment.
arXiv Detail & Related papers (2022-12-06T04:09:47Z) - Who are you referring to? Weakly supervised coreference resolution with
multimodal grounding [44.502102006343094]
Coreference resolution aims at identifying words and phrases which refer to same entity in a text.
Most existing image-text datasets contain short sentences without coreferent expressions, or coreferences are not annotated.
We propose a new technique that learns to identify coreference chains through weakly supervised grounding from image-text pairs and a regularization using prior linguistic knowledge.
arXiv Detail & Related papers (2022-11-26T13:33:42Z) - Unsupervised Image-to-Image Translation with Generative Prior [103.54337984566877]
Unsupervised image-to-image translation aims to learn the translation between two visual domains without paired data.
We present a novel framework, Generative Prior-guided UN Image-to-image Translation (GP-UNIT), to improve the overall quality and applicability of the translation algorithm.
arXiv Detail & Related papers (2022-04-07T17:59:23Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Structured Domain Adaptation with Online Relation Regularization for
Unsupervised Person Re-ID [62.90727103061876]
Unsupervised domain adaptation (UDA) aims at adapting the model trained on a labeled source-domain dataset to an unlabeled target-domain dataset.
We propose an end-to-end structured domain adaptation framework with an online relation-consistency regularization term.
Our proposed framework is shown to achieve state-of-the-art performance on multiple UDA tasks of person re-ID.
arXiv Detail & Related papers (2020-03-14T14:45:18Z) - Supervised Domain Adaptation using Graph Embedding [86.3361797111839]
Domain adaptation methods assume that distributions between the two domains are shifted and attempt to realign them.
We propose a generic framework based on graph embedding.
We show that the proposed approach leads to a powerful Domain Adaptation framework.
arXiv Detail & Related papers (2020-03-09T12:25:13Z) - Learning to adapt class-specific features across domains for semantic
segmentation [36.36210909649728]
In this thesis, we present a novel architecture, which learns to adapt features across domains by taking into account per class information.
We adopt the recently introduced StarGAN architecture as image translation backbone, since it is able to perform translations across multiple domains by means of a single generator network.
arXiv Detail & Related papers (2020-01-22T23:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.