Related papers: BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models

BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models

URL: http://arxiv.org/abs/2601.02147v1
Date: Mon, 05 Jan 2026 14:22:20 GMT
Title: BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models
Authors: Sunny Gupta, Shounak Das, Amit Sethi,
Abstract summary: We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation.<n>On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce prediction consistency between causal and spurious regions.<n>On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space.
Score: 7.174865411448373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.

Related papers

Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models [10.45965859391796]
Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples.<n>Most prompt-based TTA methods rely on entropy minimization.<n>We propose Fair Context Learning (FCL) that avoids entropy minimization by explicitly addressing shared-evidence bias.
arXiv Detail & Related papers (2026-02-02T16:02:50Z)
ICON: Invariant Counterfactual Optimization with Neuro-Symbolic Priors for Text-Based Person Search [6.247167721048087]
Text-Based Person Search holds unique value in real-world surveillance bridging visual perception and language understanding.<n>Current paradigms utilizing pre-training models often fail to transfer effectively to complex open-world scenarios.<n>This paper proposes ICON, a framework integrating causal and topological priors.
arXiv Detail & Related papers (2026-01-22T13:09:22Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition [36.36218470387896]
We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment?<n>To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings.<n>Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks.
arXiv Detail & Related papers (2025-10-30T13:11:23Z)
Enhancing CLIP Robustness via Cross-Modality Alignment [54.01929554563447]
We propose Cross-modality Alignment, an optimal transport-based framework for vision-language models.<n> COLA restores global image-text alignment and local structural consistency in the feature space.<n> COLA is training-free and compatible with existing fine-tuned models.
arXiv Detail & Related papers (2025-10-28T03:47:44Z)
Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models [0.0]
Large language models internalize a structural trade-off between truthfulness and obsequious flattery.<n>This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning.<n>We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context.
arXiv Detail & Related papers (2025-10-19T06:36:57Z)
Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization [72.30168853571216]
multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning.<n>CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories.
arXiv Detail & Related papers (2025-09-26T04:32:26Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval [48.85977777168096]
Gap-Aware Retrieval framework introduces a learnable, pair-specific increment $Delta_ij$ between text $t_i$ and video $v_j$.<n>A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction.<n>Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness.
arXiv Detail & Related papers (2025-05-18T17:18:06Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.