Related papers: MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

URL: http://arxiv.org/abs/2406.17126v1
Date: Mon, 24 Jun 2024 20:29:16 GMT
Title: MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs
Authors: Wenqian Ye, Guangtao Zheng, Yunsheng Ma, Xu Cao, Bolin Lai, James M. Rehg, Aidong Zhang,
Abstract summary: Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe pitfall in deep learning models trained on single modality data. We introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations. Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases.
Score: 38.93090238335506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe robustness pitfall in deep learning models trained on single modality data. Multimodal Large Language Models (MLLMs), which integrate both vision and language models, have demonstrated strong capability in joint vision-language understanding. However, whether spurious biases are prevalent in MLLMs remains under-explored. We mitigate this gap by analyzing the spurious biases in a multimodal setting, uncovering the specific test data patterns that can manifest this problem when biases in the vision model cascade into the alignment between visual and text tokens in MLLMs. To better understand this problem, we introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations from five open-source image datasets. The VQA dataset is built from human-understandable concept information (attributes). Leveraging this benchmark, we conduct a thorough evaluation of current state-of-the-art MLLMs. Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases. To support the MLLM robustness research, we release our VQA benchmark at https://huggingface.co/datasets/mmbench/MM-SpuBench.

Related papers

Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation [9.434966074326056]
Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination risk masking true generalization.<n>We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks.<n>We demonstrate that fine-tuning on simulated test data (extreme contamination) drastically sharpens task-specific performance but harms overall generalization.
arXiv Detail & Related papers (2025-06-08T15:52:38Z)
STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset [13.574832958298911]
STORM is a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating.<n>We propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts.<n>This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels.
arXiv Detail & Related papers (2025-06-02T14:48:15Z)
Can Multimodal Large Language Models Understand Spatial Relations? [16.76001474065412]
We introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO 2017.<n>Results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%.
arXiv Detail & Related papers (2025-05-25T07:37:34Z)
MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images.<n>MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs.<n>This paper argues that MLLMs are deeply affected by modality bias, highlighting its manifestations across various tasks.
arXiv Detail & Related papers (2025-05-24T11:49:31Z)
Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT) GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z)
Seeing What's Not There: Spurious Correlation in Multimodal LLMs [47.651861502104715]
We introduce SpurLens, a pipeline that automatically identifies spurious visual cues without human supervision. Our findings reveal that spurious correlations cause two major failure modes in Multimodal Large Language Models (MLLMs) By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.
arXiv Detail & Related papers (2025-03-11T20:53:00Z)
Protecting multimodal large language models against misleading visualizations [94.71976205962527]
We introduce the first inference-time methods to improve performance on misleading visualizations. We find that MLLM question-answering accuracy drops on average to the level of a random baseline.
arXiv Detail & Related papers (2025-02-27T20:22:34Z)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs. We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches. We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) [26.475993408532304]
We study the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries. We propose the MM-R$3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs. Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.
arXiv Detail & Related papers (2024-10-07T06:36:55Z)
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [12.841405829775852]
We introduce the modality importance score (MIS) to identify bias inVidQA benchmarks and datasets. We also propose an innovative method using state-of-the-art MLLMs to estimate the modality importance. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.
arXiv Detail & Related papers (2024-08-22T23:32:42Z)
Social Debiasing for Fair Multi-modal LLMs [55.8071045346024]
Multi-modal Large Language Models (MLLMs) have advanced significantly, offering powerful vision-language understanding capabilities. However, these models often inherit severe social biases from their training datasets, leading to unfair predictions based on attributes like race and gender. This paper addresses the issue of social biases in MLLMs by i) Introducing a comprehensive Counterfactual dataset with Multiple Social Concepts (CMSC) and ii) Proposing an Anti-Stereotype Debiasing strategy (ASD)
arXiv Detail & Related papers (2024-08-13T02:08:32Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
MMRel: A Relation Understanding Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a benchmark that features large-scale, high-quality, and diverse data on inter-object relations. MMRel is ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability.
arXiv Detail & Related papers (2024-06-13T13:51:59Z)
CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques. We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z)
MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks. MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z)
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.