Towards Multimodal Human Intention Understanding Debiasing via
Subject-Deconfounding
- URL: http://arxiv.org/abs/2403.05025v1
- Date: Fri, 8 Mar 2024 04:03:54 GMT
- Title: Towards Multimodal Human Intention Understanding Debiasing via
Subject-Deconfounding
- Authors: Dingkang Yang, Dongling Xiao, Ke Li, Yuzheng Wang, Zhaoyu Chen, Jinjie
Wei, Lihua Zhang
- Abstract summary: We propose SuCI, a causal intervention module to disentangle the impact of subjects acting as unobserved confounders.
As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions.
- Score: 15.525357031558753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal intention understanding (MIU) is an indispensable component of
human expression analysis (e.g., sentiment or humor) from heterogeneous
modalities, including visual postures, linguistic contents, and acoustic
behaviors. Existing works invariably focus on designing sophisticated
structures or fusion strategies to achieve impressive improvements.
Unfortunately, they all suffer from the subject variation problem due to data
distribution discrepancies among subjects. Concretely, MIU models are easily
misled by distinct subjects with different expression customs and
characteristics in the training data to learn subject-specific spurious
correlations, significantly limiting performance and generalizability across
uninitiated subjects.Motivated by this observation, we introduce a
recapitulative causal graph to formulate the MIU procedure and analyze the
confounding effect of subjects. Then, we propose SuCI, a simple yet effective
causal intervention module to disentangle the impact of subjects acting as
unobserved confounders and achieve model training via true causal effects. As a
plug-and-play component, SuCI can be widely applied to most methods that seek
unbiased predictions. Comprehensive experiments on several MIU benchmarks
clearly demonstrate the effectiveness of the proposed module.
Related papers
- Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective [23.49276487518479]
We explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner.
Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling.
arXiv Detail & Related papers (2024-12-22T14:59:19Z) - A Debate-Driven Experiment on LLM Hallucinations and Accuracy [7.821303946741665]
This study investigates the phenomenon of hallucination in large language models (LLMs)
Multiple instances of GPT-4o-Mini models engage in a debate-like interaction prompted with questions from the TruthfulQA dataset.
One model is deliberately instructed to generate plausible but false answers while the other models are asked to respond truthfully.
arXiv Detail & Related papers (2024-10-25T11:41:27Z) - The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio [118.75449542080746]
This paper presents the first systematic investigation of hallucinations in large multimodal models (LMMs)
Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations.
Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning.
arXiv Detail & Related papers (2024-10-16T17:59:02Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks [0.0]
This paper examines how the sequencing of images and text within multi-modal prompts influences the reasoning performance of large language models (LLMs)
For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy.
In more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task.
arXiv Detail & Related papers (2024-10-04T00:55:15Z) - Most Influential Subset Selection: Challenges, Promises, and Beyond [9.479235005673683]
We study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence.
We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses.
We demonstrate that an adaptive version of theses which applies them iteratively, can effectively capture the interactions among samples.
arXiv Detail & Related papers (2024-09-25T20:00:23Z) - Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training [14.450673163785094]
Context-Aware Emotion Recognition (CAER) provides valuable semantic cues for recognizing the emotions of target persons.
Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts.
We present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder.
arXiv Detail & Related papers (2024-07-06T05:29:02Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models [53.337728969143086]
Recommendation systems harness user-item interactions like clicks and reviews to learn their representations.
Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents.
We introduce a chain-based prompting approach to uncover semantic aspect-aware interactions.
arXiv Detail & Related papers (2023-12-26T15:44:09Z) - CausalDialogue: Modeling Utterance-level Causality in Conversations [83.03604651485327]
We have compiled and expanded upon a new dataset called CausalDialogue through crowd-sourcing.
This dataset includes multiple cause-effect pairs within a directed acyclic graph (DAG) structure.
We propose a causality-enhanced method called Exponential Average Treatment Effect (ExMATE) to enhance the impact of causality at the utterance level in training neural conversation models.
arXiv Detail & Related papers (2022-12-20T18:31:50Z) - Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment
Analysis [56.84237932819403]
This paper aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization.
Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis.
arXiv Detail & Related papers (2022-07-24T03:57:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.