Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models
- URL: http://arxiv.org/abs/2510.22785v1
- Date: Sun, 26 Oct 2025 18:37:12 GMT
- Title: Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models
- Authors: Jiaxiang Liu, Jiawei Du, Xiao Liu, Prayag Tiwari, Mingkun Xu,
- Abstract summary: Self-Calibrated Consistency is an effective test-time defense against adversarial attacks.<n> SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy.<n>These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP.
- Score: 31.920092341939593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks -- lack of semantic guidance and vulnerability to view variations -- collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the target embedding from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations. Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader vision-language domains such as BioMedCLIP.
Related papers
- Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction [67.45032003041399]
We propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbations.<n>SADCA establishes a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations.<n>Experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.
arXiv Detail & Related papers (2026-03-05T05:46:16Z) - GuardFed: A Trustworthy Federated Learning Framework Against Dual-Facet Attacks [56.983319121358555]
Federated learning (FL) enables privacy-preserving collaborative model training but remains vulnerable to adversarial behaviors.<n>We introduce the Dual-Facet Attack (DFA), a novel threat model that concurrently undermines predictive accuracy and group fairness.<n>We propose GuardFed, a self-adaptive defense framework that maintains a fairness-aware reference model using a small amount of clean server data.
arXiv Detail & Related papers (2025-11-12T13:02:45Z) - Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference [45.723695657400576]
We argue that enhancing the diversity and coverage of counterattacks is crucial to improving adversarial robustness in test-time defense.<n>We propose Directional Orthogonal Counterattack (DOC), which augments counterattack optimization by incorporating gradient directions and momentum-based updates.<n>We present a directional sensitivity score based on averaged cosine similarity to boost DOC by improving example discrimination and adaptively modulating the counterattack strength.
arXiv Detail & Related papers (2025-11-12T07:40:16Z) - Enhancing CLIP Robustness via Cross-Modality Alignment [54.01929554563447]
We propose Cross-modality Alignment, an optimal transport-based framework for vision-language models.<n> COLA restores global image-text alignment and local structural consistency in the feature space.<n> COLA is training-free and compatible with existing fine-tuned models.
arXiv Detail & Related papers (2025-10-28T03:47:44Z) - Harnessing Consistency for Robust Test-Time LLM Ensemble [88.55393815158608]
CoRE is a plug-and-play technique that harnesses model consistency for robust LLM ensemble.<n> Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens.<n>Model-level consistency models global agreement by promoting model outputs with high self-confidence.
arXiv Detail & Related papers (2025-10-12T04:18:45Z) - Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting [1.5268922363885407]
We propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models.<n>CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency.
arXiv Detail & Related papers (2025-10-03T11:36:02Z) - Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization [72.30168853571216]
multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning.<n>CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories.
arXiv Detail & Related papers (2025-09-26T04:32:26Z) - DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation [18.102129546708905]
We present DyCON, a Dynamic Uncertainty-aware Consistency and Contrastive Learning framework.<n>UnCL enforces global consistency by dynamically weighting the contribution of each voxel to the consistency loss.<n>FeCL enhances local feature discrimination in imbalanced regions by introducing dual focal mechanisms.
arXiv Detail & Related papers (2025-04-06T17:50:22Z) - CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models [16.5022773312661]
We propose a universal certified defence framework to safeguard large vision-language models against jailbreak attacks.<n>First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses.<n>Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees.
arXiv Detail & Related papers (2025-03-08T17:33:55Z) - CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP [54.660471826755234]
We show that malicious perturbations that seek to maximise the classification loss lead to falsely stable' images.<n>We propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness.<n>Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time.
arXiv Detail & Related papers (2025-03-05T15:51:59Z) - TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models [53.91006249339802]
We propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks.
TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP.
We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets.
arXiv Detail & Related papers (2024-11-20T08:58:59Z) - When Does Contrastive Learning Preserve Adversarial Robustness from
Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework.
We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.