HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models
- URL: http://arxiv.org/abs/2511.12693v1
- Date: Sun, 16 Nov 2025 17:16:31 GMT
- Title: HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models
- Authors: Sushant Gautam, Michael A. Riegler, Pål Halvorsen,
- Abstract summary: Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations.<n>We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics.
- Score: 4.099133096025821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures. Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the generated answers, whereas NLI-based clustering remains advantageous for LLaVA-Med and for longer, sentence-level responses. Across configurations, the VASE metric consistently provides the most robust hallucination signal, especially when paired with embedding clustering and a moderate sampling budget (n ~ 10-15). Prompt design also matters: concise, label-style outputs offer clearer semantic structure than syntactically constrained one-sentence responses. By framing hallucination detection as a geometric robustness problem shaped jointly by sampling scale, prompt structure, model architecture, and clustering strategy, HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE .
Related papers
- Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification [69.87877580725768]
Multimodal Visual Surrogate Compression (MVSC) learns to compress and adapt large 3D sMRI volumes into compact 2D features.<n>MVSC has two key components: a Volume Context that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner.
arXiv Detail & Related papers (2026-01-29T13:05:46Z) - VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations [4.509454543418357]
Hallucinations in video-capable vision models (VideoVLMs) remain frequent and high-confidence.<n>We introduce VideoHEDGE, a modular framework for hallucination detection in question answering.
arXiv Detail & Related papers (2026-01-13T13:42:05Z) - Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z) - Manifold-based Sampling for In-Context Hallucination Detection in Large Language Models [5.187020963919455]
Large language models (LLMs) frequently generate factually incorrect or unsupported content, commonly referred to as hallucinations.<n>We propose MB-ICL, a manifold-based demonstration sampling framework for selecting in-context demonstrations.
arXiv Detail & Related papers (2026-01-08T06:17:18Z) - Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck [0.0]
We develop a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering.<n>Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound.
arXiv Detail & Related papers (2025-08-26T20:00:51Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach [0.0]
Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations.<n>We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple generations.<n>Our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis.
arXiv Detail & Related papers (2025-07-05T19:20:59Z) - Unsupervised Deep Clustering of MNIST with Triplet-Enhanced Convolutional Autoencoders [0.0]
This research implements an advanced unsupervised clustering system for MNIST handwritten digits.<n>A deep neural autoencoder requires a training process during phase one to develop minimal yet interpretive representations of images.
arXiv Detail & Related papers (2025-06-11T18:26:13Z) - Hallucination Detection in LLMs with Topological Divergence on Attention Graphs [60.83579255387347]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.<n>We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z) - Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction [17.989559761931435]
We propose a novel "Fine-grained Visual-Semantic Interaction" framework for WSI classification.
It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics.
Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset.
arXiv Detail & Related papers (2024-02-29T16:29:53Z) - Learning Multiscale Consistency for Self-supervised Electron Microscopy
Instance Segmentation [48.267001230607306]
We propose a pretraining framework that enhances multiscale consistency in EM volumes.
Our approach leverages a Siamese network architecture, integrating strong and weak data augmentations.
It effectively captures voxel and feature consistency, showing promise for learning transferable representations for EM analysis.
arXiv Detail & Related papers (2023-08-19T05:49:13Z) - GSMFlow: Generation Shifts Mitigating Flow for Generalized Zero-Shot
Learning [55.79997930181418]
Generalized Zero-Shot Learning aims to recognize images from both the seen and unseen classes by transferring semantic knowledge from seen to unseen classes.
It is a promising solution to take the advantage of generative models to hallucinate realistic unseen samples based on the knowledge learned from the seen classes.
We propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation.
arXiv Detail & Related papers (2022-07-05T04:04:37Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.