Measuring Epistemic Humility in Multimodal Large Language Models
- URL: http://arxiv.org/abs/2509.09658v1
- Date: Thu, 11 Sep 2025 17:54:00 GMT
- Title: Measuring Epistemic Humility in Multimodal Large Language Models
- Authors: Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou,
- Abstract summary: We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers.<n>We leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions.<n>HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings.
- Score: 17.490955813494693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.
Related papers
- MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models [29.830224745428566]
We present MMErroR, a benchmark of 2,013 samples each embedding a single coherent reasoning error.<n>Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation.<n>We evaluate 20 advanced Vision-Language Models, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases.
arXiv Detail & Related papers (2026-01-06T17:45:26Z) - Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding [41.828387997311474]
Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering.<n>They remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects.<n>We propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference.
arXiv Detail & Related papers (2025-10-21T06:11:24Z) - LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization [9.410181019585822]
We operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs.<n>We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them.<n>Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context.
arXiv Detail & Related papers (2025-10-05T03:14:05Z) - Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection [71.8243083897721]
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability.<n>We present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training.
arXiv Detail & Related papers (2025-09-27T10:37:11Z) - Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability [10.607081850023286]
We introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics.<n>Most models struggle to actively detect flawed textual premises without guidance.<n>These insights underscore the urgent need to enhance LMMs' proactive verification of input validity.
arXiv Detail & Related papers (2025-08-06T02:13:46Z) - Token Level Hallucination Detection via Variance in Language Models [0.0]
Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations.<n>We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple generations.<n>Our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis.
arXiv Detail & Related papers (2025-07-05T19:20:59Z) - Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation [9.434966074326056]
Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination risk masking true generalization.<n>We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks.<n>We demonstrate that fine-tuning on simulated test data (extreme contamination) drastically sharpens task-specific performance but harms overall generalization.
arXiv Detail & Related papers (2025-06-08T15:52:38Z) - BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models [2.526146573337397]
We propose a new evaluation methodology, inspired by ophthalmologic diagnostics.<n>We use procedural generation of synthetic images to obtain control over visual attributes.<n>This diagnostic allows systematic stress testing and fine-grained failure analysis.
arXiv Detail & Related papers (2025-06-05T12:43:10Z) - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z) - Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.