Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
- URL: http://arxiv.org/abs/2507.06603v2
- Date: Fri, 01 Aug 2025 08:19:46 GMT
- Title: Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
- Authors: Xu Shaowu, Jia Xibin, Gao Junyu, Sun Qianmei, Chang Jing, Fan Chao,
- Abstract summary: Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders.<n>This paper proposes a structural causal model to uncover causal relationships between videos and label texts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.
Related papers
- Argument-Centric Causal Intervention Method for Mitigating Bias in Cross-Document Event Coreference Resolution [12.185497507437555]
Cross-document Event Coreference Resolution (CD-ECR) seeks to determine whether event mentions across multiple documents refer to the same real-world occurrence.<n>We propose a novel method based on Argument-Centric Causal Intervention (ACCI)<n>ACCI integrates a counterfactual reasoning module that quantifies the causal influence of trigger word perturbations, and an argument-aware enhancement module to promote greater sensitivity to semantically grounded information.
arXiv Detail & Related papers (2025-06-02T09:46:59Z) - Deconfounded Reasoning for Multimodal Fake News Detection via Causal Intervention [16.607714608483164]
The rapid growth of social media has led to the widespread dissemination of fake news across multiple content forms.<n>Traditional unimodal detection methods fall short in addressing complex cross-modal manipulations.<n>We propose the Causal Intervention-based Multimodal Decon Detection framework.
arXiv Detail & Related papers (2025-04-12T09:57:43Z) - Treble Counterfactual VLMs: A Causal Approach to Hallucination [6.3952983618258665]
VisionLanguage Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning.<n>They often generate hallucinated outputs inconsistent with the visual context or prompt.<n>Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding.
arXiv Detail & Related papers (2025-03-08T11:13:05Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task.
We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z) - Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL [57.202733701029594]
We propose Decision Mamba, a novel multi-grained state space model (SSM) with a self-evolving policy learning strategy.<n>To address these challenges, we propose Decision Mamba, a novel multi-grained state space model (SSM) with a self-evolving policy learning strategy.<n>To mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization.
arXiv Detail & Related papers (2024-06-08T10:12:00Z) - Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.<n>However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z) - Improving Vision Anomaly Detection with the Guidance of Language
Modality [64.53005837237754]
This paper tackles the challenges for vision modality from a multimodal point of view.
We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue.
To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
arXiv Detail & Related papers (2023-10-04T13:44:56Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering [134.91774666260338]
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes.
We propose a framework for cross-modal causal relational reasoning to address the task of event-level visual question answering.
arXiv Detail & Related papers (2022-07-26T04:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.