Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench
- URL: http://arxiv.org/abs/2601.02737v2
- Date: Thu, 15 Jan 2026 06:08:23 GMT
- Title: Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench
- Authors: Zanting Ye, Xiaolong Niu, Xuanbin Wu, Xu Han, Shengyuan Liu, Jing Hao, Zhihao Peng, Hao Sun, Jieqin Lv, Fanghu Wang, Yanchao Huang, Hubing Wu, Yixuan Yuan, Habib Zaidi, Arman Rahmim, Yefeng Zheng, Lijun Lu,
- Abstract summary: Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities.<n>In this work, we quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors.<n>We introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies.<n>Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic
- Score: 48.60251555171943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting, widely considered to enhance reasoning, paradoxically decouples linguistic generation from visual evidence in PET, producing clinically fluent but factually ungrounded diagnoses. To resolve this, we propose Atomic Visual Alignment (AVA), a simple fine-tuning strategy that enforces the mastery of low-level functional perception prior to high-level diagnostic reasoning. Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic accuracy by up to 14.83%. Code and data are available at https://github.com/yezanting/PET-Bench.
Related papers
- ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval [64.14282916266998]
Composed Image Retrieval aims to retrieve target images based on a hybrid query comprising a reference image and a modification text.<n>We propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline.<n>Experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-02T04:52:54Z) - MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - Towards Interactive Lesion Segmentation in Whole-Body PET/CT with Promptable Models [9.254535645473311]
We present our submission to the autoPET/CT IV challenge.<n>We extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels.<n>Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance.
arXiv Detail & Related papers (2025-08-29T14:49:58Z) - PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography [24.091435019102587]
Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming.<n>Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications.<n>We introduce PET2Rep, a benchmark for evaluation of general and medical VLMs for radiology report generation for PET images.
arXiv Detail & Related papers (2025-08-06T03:46:51Z) - ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models [82.04858317800097]
We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts.<n>ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues.<n>We introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images.
arXiv Detail & Related papers (2025-08-02T15:21:26Z) - Supervised Diffusion-Model-Based PET Image Reconstruction [44.89560992517543]
Diffusion models (DMs) have been introduced as a regularizing prior for PET image reconstruction.<n>We propose a supervised DM-based algorithm for PET reconstruction.<n>Our method enforces the non-negativity of PET's Poisson likelihood model and accommodates the wide intensity range of PET images.
arXiv Detail & Related papers (2025-06-30T16:39:50Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - PAD-F: Prior-Aware Debiasing Framework for Long-Tailed X-ray Prohibited Item Detection [56.25222232778367]
The distribution of object classes in real-world prohibited item detection scenarios often exhibits a distinct long-tailed distribution.<n>We introduce the Prior-Aware Debiasing Framework (PAD-F), a novel approach that employs a two-pronged strategy.<n>PAD-F significantly boosts the performance of multiple popular detectors.
arXiv Detail & Related papers (2024-11-27T06:13:56Z) - From FDG to PSMA: A Hitchhiker's Guide to Multitracer, Multicenter Lesion Segmentation in PET/CT Imaging [0.9384264274298444]
We present our solution for the autoPET III challenge, targeting multitracer, multicenter generalization using the nnU-Net framework with the ResEncL architecture.
Key techniques include misalignment data augmentation and multi-modal pretraining across CT, MR, and PET datasets.
Compared to the default nnU-Net, which achieved a Dice score of 57.61, our model significantly improved performance with a Dice score of 68.40, alongside a reduction in false positive (FPvol: 7.82) and false negative (FNvol: 10.35) volumes.
arXiv Detail & Related papers (2024-09-14T16:39:17Z) - Multi-modal Evidential Fusion Network for Trustworthy PET/CT Tumor Segmentation [5.839660501978193]
In clinical settings, the quality of PET and CT images often varies significantly, leading to uncertainty in the modality information extracted by networks.<n>We propose a novel Multi-modal Evidential Fusion Network (MEFN), which consists of two core stages: Cross-Modal Feature Learning (CFL) and Multi-modal Trustworthy Fusion (MTF)<n>Our model can provide radiologists with credible uncertainty of the segmentation results for their decision in accepting or rejecting the automatic segmentation results.
arXiv Detail & Related papers (2024-06-26T13:14:24Z) - Bilinear pooling and metric learning network for early Alzheimer's
disease identification with FDG-PET images [0.293168019422713]
FDG-PET reveals altered brain metabolism in individuals with mild cognitive impairment (MCI) and Alzheimer's disease (AD)
In this paper, we propose a novel bilinear pooling and metric learning network (BMNet) which can extract the inter-region representation features and distinguish hard samples by embedding space.
arXiv Detail & Related papers (2021-11-09T08:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.