Towards Multimodal Human Intention Understanding Debiasing via
Subject-Deconfounding
- URL: http://arxiv.org/abs/2403.05025v1
- Date: Fri, 8 Mar 2024 04:03:54 GMT
- Title: Towards Multimodal Human Intention Understanding Debiasing via
Subject-Deconfounding
- Authors: Dingkang Yang, Dongling Xiao, Ke Li, Yuzheng Wang, Zhaoyu Chen, Jinjie
Wei, Lihua Zhang
- Abstract summary: We propose SuCI, a causal intervention module to disentangle the impact of subjects acting as unobserved confounders.
As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions.
- Score: 15.525357031558753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal intention understanding (MIU) is an indispensable component of
human expression analysis (e.g., sentiment or humor) from heterogeneous
modalities, including visual postures, linguistic contents, and acoustic
behaviors. Existing works invariably focus on designing sophisticated
structures or fusion strategies to achieve impressive improvements.
Unfortunately, they all suffer from the subject variation problem due to data
distribution discrepancies among subjects. Concretely, MIU models are easily
misled by distinct subjects with different expression customs and
characteristics in the training data to learn subject-specific spurious
correlations, significantly limiting performance and generalizability across
uninitiated subjects.Motivated by this observation, we introduce a
recapitulative causal graph to formulate the MIU procedure and analyze the
confounding effect of subjects. Then, we propose SuCI, a simple yet effective
causal intervention module to disentangle the impact of subjects acting as
unobserved confounders and achieve model training via true causal effects. As a
plug-and-play component, SuCI can be widely applied to most methods that seek
unbiased predictions. Comprehensive experiments on several MIU benchmarks
clearly demonstrate the effectiveness of the proposed module.
Related papers
- A Debate-Driven Experiment on LLM Hallucinations and Accuracy [7.821303946741665]
This study investigates the phenomenon of hallucination in large language models (LLMs)
Multiple instances of GPT-4o-Mini models engage in a debate-like interaction prompted with questions from the TruthfulQA dataset.
One model is deliberately instructed to generate plausible but false answers while the other models are asked to respond truthfully.
arXiv Detail & Related papers (2024-10-25T11:41:27Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training [14.450673163785094]
Context-Aware Emotion Recognition (CAER) provides valuable semantic cues for recognizing the emotions of target persons.
Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts.
We present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder.
arXiv Detail & Related papers (2024-07-06T05:29:02Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Towards Multimodal Sentiment Analysis Debiasing via Bias Purification [21.170000473208372]
Multimodal Sentiment Analysis (MSA) aims to understand human intentions by integrating emotion-related clues from diverse modalities.
MSA task invariably suffers from unplanned dataset biases, particularly multimodal utterance-level label bias and word-level context bias.
We present a Multimodal Counterfactual Inference Sentiment analysis framework based on causality rather than conventional likelihood.
arXiv Detail & Related papers (2024-03-08T03:55:27Z) - Triple Disentangled Representation Learning for Multimodal Affective Analysis [20.37986194570143]
Multimodal learning has exhibited a significant advantage in affective analysis tasks.
Many emerging studies focus on disentangling the modality-invariant and modality-specific representations from input data and then fusing them for prediction.
We propose a novel triple disentanglement approach, TriDiRA, which disentangles the modality-invariant, effective modality-specific and ineffective modality-specific representations from input data.
arXiv Detail & Related papers (2024-01-29T12:45:27Z) - Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models [53.337728969143086]
Recommendation systems harness user-item interactions like clicks and reviews to learn their representations.
Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents.
We introduce a chain-based prompting approach to uncover semantic aspect-aware interactions.
arXiv Detail & Related papers (2023-12-26T15:44:09Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Context De-confounded Emotion Recognition [12.037240778629346]
Context-Aware Emotion Recognition (CAER) aims to perceive the emotional states of the target person with contextual information.
A long-overlooked issue is that a context bias in existing datasets leads to a significantly unbalanced distribution of emotional states.
This paper provides a causality-based perspective to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task.
arXiv Detail & Related papers (2023-03-21T15:12:20Z) - Multilingual Multi-Aspect Explainability Analyses on Machine Reading Comprehension Models [76.48370548802464]
This paper focuses on conducting a series of analytical experiments to examine the relations between the multi-head self-attention and the final MRC system performance.
We discover that passage-to-question and passage understanding attentions are the most important ones in the question answering process.
Through comprehensive visualizations and case studies, we also observe several general findings on the attention maps, which can be helpful to understand how these models solve the questions.
arXiv Detail & Related papers (2021-08-26T04:23:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.