Related papers: Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

URL: http://arxiv.org/abs/2508.16122v3
Date: Tue, 21 Oct 2025 14:28:06 GMT
Title: Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Authors: Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal,
Abstract summary: This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task.<n>Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets.
Score: 12.754751703604734
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.

Related papers

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models [63.032359320629105]
We introduce: Unpaired Multimodal, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them.<n>We show that using unpaired data from auxiliary modalities consistently improves downstream performance across diverse unimodal targets such as image and audio.
arXiv Detail & Related papers (2025-10-09T17:32:23Z)
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning [4.963955559863751]
MMAT-1M is the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage.<n>Our dataset is constructed through a novel four-stage data engine.<n>By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains.
arXiv Detail & Related papers (2025-07-29T15:39:14Z)
Does Multimodality Lead to Better Time Series Forecasting? [84.74978289870155]
It remains unclear whether and under what conditions such multimodal integration consistently yields gains.<n>We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting.<n>Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies.
arXiv Detail & Related papers (2025-06-20T23:55:56Z)
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [71.3533541927459]
We propose a novel data selection paradigm termed Activation Reasoning Potential (RAP)<n>RAP identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.<n>Our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.
arXiv Detail & Related papers (2025-06-05T08:40:24Z)
Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models [0.0]
This project performs multimodal sentiment analysis using the CMU-MOSEI dataset.<n>We use transformer-based models with early fusion to integrate text, audio, and visual modalities.<n>The model achieves strong performance, with 97.87% 7-class accuracy and a 0.9682 F1-score on the test set.
arXiv Detail & Related papers (2025-05-09T15:10:57Z)
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [50.492124556982674]
This paper introduces a novel choice-based sample selection framework.<n>It shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>We validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
arXiv Detail & Related papers (2025-03-04T07:32:41Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Debiasing Multimodal Models via Causal Information Minimization [65.23982806840182]
We study bias arising from confounders in a causal graph for multimodal data. Robust predictive features contain diverse information that helps a model generalize to out-of-distribution data. We use these features as confounder representations and use them via methods motivated by causal theory to remove bias from models.
arXiv Detail & Related papers (2023-11-28T16:46:14Z)
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z)
Defending Multimodal Fusion Models against Single-Source Adversaries [6.019777076722421]
We show that standard multimodal fusion models are vulnerable to single-source adversaries. An attack on any single modality can overcome the correct information from multiple unperturbed modalities and cause the model to fail. Motivated by this finding, we propose an adversarially robust fusion strategy.
arXiv Detail & Related papers (2022-06-25T18:57:02Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.