Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation
- URL: http://arxiv.org/abs/2506.11493v1
- Date: Fri, 13 Jun 2025 06:33:27 GMT
- Title: Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation
- Authors: Tung-Long Vuong, Hoang Phan, Vy Vo, Anh Bui, Thanh-Toan Do, Trung Le, Dinh Phung,
- Abstract summary: This work introduces a fresh solution to reinforce base pseudo-labels and facilitate target-prompt learning.<n>We first propose to leverage the reference predictions based on the relationship between source and target visual embeddings.<n>We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models.
- Score: 29.809079908218607
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains can deviate from the visual embedding distribution in the pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforce these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we transform this insight into a novel strategy to enforce the clustering property in text embeddings, further enhancing the alignment in the target domain. Our experiments and ablation studies validate the effectiveness of the proposed approach, demonstrating superior performance and improved quality of target prompts in terms of representation.
Related papers
- OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP [15.780915391081734]
Low-Shot Open-Set Domain Generalization (LSOSDG) is a novel paradigm unifying low-shot learning with open-set domain generalization (ODG)<n>We propose OSLOPROMPT, an advanced prompt-learning framework for CLIP with two core innovations.
arXiv Detail & Related papers (2025-03-20T12:51:19Z) - ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks [42.18755809782401]
We propose a novel transfer attack method called PDCL-Attack.<n>We formulate an effective prompt-driven feature guidance by harnessing the semantic representation power of text.
arXiv Detail & Related papers (2024-07-30T08:52:16Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Deep face recognition with clustering based domain adaptation [57.29464116557734]
We propose a new clustering-based domain adaptation method designed for face recognition task in which the source and target domain do not share any classes.
Our method effectively learns the discriminative target feature by aligning the feature domain globally, and, at the meantime, distinguishing the target clusters locally.
arXiv Detail & Related papers (2022-05-27T12:29:11Z) - Learning Rich Nearest Neighbor Representations from Self-supervised
Ensembles [60.97922557957857]
We provide a framework to perform self-supervised model ensembling via a novel method of learning representations directly through gradient descent at inference time.
This technique improves representation quality, as measured by k-nearest neighbors, both on the in-domain dataset and in the transfer setting.
arXiv Detail & Related papers (2021-10-19T22:24:57Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Self-Supervised Prototypical Transfer Learning for Few-Shot
Classification [11.96734018295146]
Self-supervised transfer learning approach ProtoTransfer outperforms state-of-the-art unsupervised meta-learning methods on few-shot tasks.
In few-shot experiments with domain shift, our approach even has comparable performance to supervised methods, but requires orders of magnitude fewer labels.
arXiv Detail & Related papers (2020-06-19T19:00:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.