Enhancing CLIP Robustness via Cross-Modality Alignment
- URL: http://arxiv.org/abs/2510.24038v1
- Date: Tue, 28 Oct 2025 03:47:44 GMT
- Title: Enhancing CLIP Robustness via Cross-Modality Alignment
- Authors: Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang,
- Abstract summary: We propose Cross-modality Alignment, an optimal transport-based framework for vision-language models.<n> COLA restores global image-text alignment and local structural consistency in the feature space.<n> COLA is training-free and compatible with existing fine-tuned models.
- Score: 54.01929554563447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.
Related papers
- Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models [59.242742594156546]
CoEvo is a test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies.<n>CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
arXiv Detail & Related papers (2026-01-13T12:08:26Z) - BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models [7.174865411448373]
We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation.<n>On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce prediction consistency between causal and spurious regions.<n>On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space.
arXiv Detail & Related papers (2026-01-05T14:22:20Z) - Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models [31.920092341939593]
Self-Calibrated Consistency is an effective test-time defense against adversarial attacks.<n> SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy.<n>These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP.
arXiv Detail & Related papers (2025-10-26T18:37:12Z) - Modest-Align: Data-Efficient Alignment for Vision-Language Models [67.48633659305592]
Cross-modal alignment models often suffer from overconfidence and degraded performance when operating in resource-constrained settings.<n>We propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency.<n>Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
arXiv Detail & Related papers (2025-10-24T16:11:10Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation [9.906359339999039]
We introduce a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model.<n>We propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces.<n>Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts.
arXiv Detail & Related papers (2025-08-08T16:51:44Z) - Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning [11.752632557524969]
Causal CLIP Adapter (CCA) is a novel framework that explicitly disentangles visual features extracted from CLIP.<n>Our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts.
arXiv Detail & Related papers (2025-08-05T05:30:42Z) - Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score [11.74414842618874]
We show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels.<n>We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings.<n>Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78%.
arXiv Detail & Related papers (2025-07-13T12:38:38Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Domain Aligned CLIP for Few-shot Classification [3.5326413171911555]
Domain Aligned CLIP (DAC) improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model.
We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3%.
arXiv Detail & Related papers (2023-11-15T18:34:26Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.