Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
- URL: http://arxiv.org/abs/2601.21078v1
- Date: Wed, 28 Jan 2026 22:03:46 GMT
- Title: Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
- Authors: Jiaqi Li, Guangming Wang, Shuntian Zheng, Minzhe Ni, Xiaoman Lu, Guanghui Ye, Yu Guan,
- Abstract summary: We propose ActionVLM, a vision-language aggregation framework that mitigates modality bias in TAL.<n>Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial.<n>Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.
- Score: 20.179748846776995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.
Related papers
- Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs [9.043999205886658]
Hallucinations in large vision-language models often arise when language priors dominate over visual evidence.<n>We propose Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths.<n>ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost.
arXiv Detail & Related papers (2026-01-20T08:04:18Z) - LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction [19.57998167905048]
End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations.<n>Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained vision models to address this limitation.<n>We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations.
arXiv Detail & Related papers (2026-01-09T08:06:44Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models [7.174865411448373]
We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation.<n>On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce prediction consistency between causal and spurious regions.<n>On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space.
arXiv Detail & Related papers (2026-01-05T14:22:20Z) - Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy [59.44168425139687]
BayesVLA is a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify.<n>Experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
arXiv Detail & Related papers (2025-12-12T01:59:23Z) - Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents [39.95793203302782]
We propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint.<n>AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames.
arXiv Detail & Related papers (2025-02-03T10:16:49Z) - VladVA: Discriminative Fine-tuning of LVLMs [67.14293827774827]
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning.<n>We propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs.
arXiv Detail & Related papers (2024-12-05T17:54:27Z) - Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.<n>We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target.
The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well.
We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.