Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
- URL: http://arxiv.org/abs/2508.03102v1
- Date: Tue, 05 Aug 2025 05:30:42 GMT
- Title: Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
- Authors: Tianjiao Jiang, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi,
- Abstract summary: Causal CLIP Adapter (CCA) is a novel framework that explicitly disentangles visual features extracted from CLIP.<n>Our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts.
- Score: 11.752632557524969
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Few-shot learning (FSL) often requires effective adaptation of models using limited labeled data. However, most existing FSL methods rely on entangled representations, requiring the model to implicitly recover the unmixing process to obtain disentangled representations using only limited supervision, which hinders effective adaptation. Recent theoretical studies show that multimodal contrastive learning methods, such as CLIP, can disentangle latent representations up to linear transformations. In light of this, we propose the Causal CLIP Adapter (CCA), a novel framework that explicitly disentangles visual features extracted from CLIP using unsupervised Independent Component Analysis (ICA). This removes the need to learn the unmixing process from the labeled data, thereby reducing the number of trainable parameters and mitigating overfitting. Taking a step further, while ICA can obtain visual disentangled representations, it may also disrupt CLIP's intra- and inter-modal alignment. To counteract this, CCA further leverages CLIP's inherent cross-modal alignment by enhancing it in two ways: unidirectionally, through fine-tuning a CLIP-based text classifier, and bidirectionally, via a cross-attention mechanism that enriches visual and textual representations through mutual interaction. Both unimodal and cross-modal classification outputs can be effectively combined linearly to improve classification accuracy. Extensive experiments on 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts, while maintaining computational efficiency. Code will be available at https://github.com/tianjiao-j/CCA.
Related papers
- Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score [11.74414842618874]
We show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels.<n>We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings.<n>Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78%.
arXiv Detail & Related papers (2025-07-13T12:38:38Z) - Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation [4.063715077687089]
Distill CLIP (DCLIP) is a fine-tuned variant of the CLIP model.<n>It enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities.
arXiv Detail & Related papers (2025-05-25T07:08:07Z) - Cross-Modal Consistency Learning for Sign Language Recognition [92.44927164283641]
Existing pre-training methods solely focus on the compact pose data.<n>We propose a Cross-modal Consistency Learning framework (CCL- SLR)<n>CCL- SLR learns from both RGB and pose modalities based on self-supervised pre-training.
arXiv Detail & Related papers (2025-03-16T12:34:07Z) - Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer representations.<n>SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Domain Aligned CLIP for Few-shot Classification [3.5326413171911555]
Domain Aligned CLIP (DAC) improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model.
We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3%.
arXiv Detail & Related papers (2023-11-15T18:34:26Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Weakly supervised segmentation with cross-modality equivariant
constraints [7.757293476741071]
Weakly supervised learning has emerged as an appealing alternative to alleviate the need for large labeled datasets in semantic segmentation.
We present a novel learning strategy that leverages self-supervision in a multi-modal image scenario to significantly enhance original CAMs.
Our approach outperforms relevant recent literature under the same learning conditions.
arXiv Detail & Related papers (2021-04-06T13:14:20Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.