TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation
- URL: http://arxiv.org/abs/2508.06452v1
- Date: Fri, 08 Aug 2025 16:51:44 GMT
- Title: TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation
- Authors: Mattia Litrico, Mario Valerio Giuffrida, Sebastiano Battiato, Devis Tuia,
- Abstract summary: We introduce a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model.<n>We propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces.<n>Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts.
- Score: 9.906359339999039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent unsupervised domain adaptation (UDA) methods have shown great success in addressing classical domain shifts (e.g., synthetic-to-real), but they still suffer under complex shifts (e.g. geographical shift), where both the background and object appearances differ significantly across domains. Prior works showed that the language modality can help in the adaptation process, exhibiting more robustness to such complex shifts. In this paper, we introduce TRUST, a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model. TRUST generates pseudo-labels for target samples from their captions and introduces a novel uncertainty estimation strategy that uses normalised CLIP similarity scores to estimate the uncertainty of the generated pseudo-labels. Such estimated uncertainty is then used to reweight the classification loss, mitigating the adverse effects of wrong pseudo-labels obtained from low-quality captions. To further increase the robustness of the vision model, we propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces, by leveraging captions to guide the contrastive training of the vision model on target images. In our contrastive loss, each pair of images acts as both a positive and a negative pair and their feature representations are attracted and repulsed with a strength proportional to the similarity of their captions. This solution avoids the need for hardly determining positive and negative pairs, which is critical in the UDA setting. Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts. The code will be available upon acceptance.
Related papers
- A Meaningful Perturbation Metric for Evaluating Explainability Methods [55.09730499143998]
We introduce a novel approach, which harnesses image generation models to perform targeted perturbation.<n> Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity.<n>This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results.
arXiv Detail & Related papers (2025-04-09T11:46:41Z) - Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection [38.596886094105216]
Source-Free domain adaptive Object Detection (SFOD) aims to transfer a detector (pre-trained on source domain) to new unlabelled target domains.
This paper introduces a novel Weak-to-Strong Contrastive Learning (WSCoL) approach.
arXiv Detail & Related papers (2024-10-07T23:32:06Z) - Domain Adaptive Object Detection via Balancing Between Self-Training and
Adversarial Learning [19.81071116581342]
Deep learning based object detectors struggle generalizing to a new target domain bearing significant variations in object and background.
Current methods align domains by using image or instance-level adversarial feature alignment.
We propose to leverage model's predictive uncertainty to strike the right balance between adversarial feature alignment and class-level alignment.
arXiv Detail & Related papers (2023-11-08T16:40:53Z) - Counterfactual Image Generation for adversarially robust and
interpretable Classifiers [1.3859669037499769]
We propose a unified framework leveraging image-to-image translation Generative Adrial Networks (GANs) to produce counterfactual samples.
This is achieved by combining the classifier and discriminator into a single model that attributes real images to their respective classes and flags generated images as "fake"
We show how the model exhibits improved robustness to adversarial attacks, and we show how the discriminator's "fakeness" value serves as an uncertainty measure of the predictions.
arXiv Detail & Related papers (2023-10-01T18:50:29Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z) - Adaptive Face Recognition Using Adversarial Information Network [57.29464116557734]
Face recognition models often degenerate when training data are different from testing data.
We propose a novel adversarial information network (AIN) to address it.
arXiv Detail & Related papers (2023-05-23T02:14:11Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Selective Pseudo-Labeling with Reinforcement Learning for
Semi-Supervised Domain Adaptation [116.48885692054724]
We propose a reinforcement learning based selective pseudo-labeling method for semi-supervised domain adaptation.
We develop a deep Q-learning model to select both accurate and representative pseudo-labeled instances.
Our proposed method is evaluated on several benchmark datasets for SSDA, and demonstrates superior performance to all the comparison methods.
arXiv Detail & Related papers (2020-12-07T03:37:38Z) - Learning from Scale-Invariant Examples for Domain Adaptation in Semantic
Segmentation [6.320141734801679]
We propose a novel approach of exploiting scale-invariance property of semantic segmentation model for self-supervised domain adaptation.
Our algorithm is based on a reasonable assumption that, in general, regardless of the size of the object and stuff (given context) the semantic labeling should be unchanged.
We show that this constraint is violated over the images of the target domain, and hence could be used to transfer labels in-between differently scaled patches.
arXiv Detail & Related papers (2020-07-28T19:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.