PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
- URL: http://arxiv.org/abs/2510.27680v1
- Date: Fri, 31 Oct 2025 17:49:01 GMT
- Title: PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
- Authors: Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel D. Church, Matthew E. Larson, Scott B. Perlman, Tomas A. Romero, Joshua D. Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw,
- Abstract summary: We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams.<n>Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation.
- Score: 10.800541081358132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
Related papers
- Lesion Segmentation in FDG-PET/CT Using Swin Transformer U-Net 3D: A Robust Deep Learning Framework [0.0]
This paper presents a Swin Transformer UNet 3D (SwinUNet3D) framework for lesion segmentation in PET/CT scans.<n>By combining shifted window self-attention with U-Net style skip connections, the model captures both global context and fine anatomical detail.<n>Results show that SwinUNet3D achieves a Dice score of 0.88 and IoU of 0.78, surpassing 3D U-Net (Dice 0.48, IoU 0.32) while also delivering faster inference times.
arXiv Detail & Related papers (2026-01-06T09:52:00Z) - Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench [48.60251555171943]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities.<n>In this work, we quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors.<n>We introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies.<n>Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic
arXiv Detail & Related papers (2026-01-06T05:58:50Z) - Vision-Language Models for Automated 3D PET/CT Report Generation [13.781844347232079]
AutoPET/CT report generation is increasingly important for reducing clinical workload.<n>PETRG-3D is an end-to-end 3D dual-branch framework that encodes PET and CT volumes and incorporates style-adaptive prompts.<n>PETRG-Score is a lymphoma-specific evaluation protocol that measures metabolic and structural findings across curated anatomical regions.
arXiv Detail & Related papers (2025-11-25T10:07:57Z) - PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography [24.091435019102587]
Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming.<n>Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications.<n>We introduce PET2Rep, a benchmark for evaluation of general and medical VLMs for radiology report generation for PET images.
arXiv Detail & Related papers (2025-08-06T03:46:51Z) - SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images [21.883098685700666]
We develop SegAnyPET, a modality-specific 3D foundation model for universal promptable segmentation from PET images.<n> Experimental results demonstrate that SegAnyPET can segment seen and unseen target organs using only one or a few prompt points.
arXiv Detail & Related papers (2025-02-20T08:17:13Z) - 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans.
Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z) - RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE.
We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - Score-Based Generative Models for PET Image Reconstruction [38.72868748574543]
We propose several PET-specific adaptations of score-based generative models.
The proposed framework is developed for both 2D and 3D PET.
In addition, we provide an extension to guided reconstruction using magnetic resonance images.
arXiv Detail & Related papers (2023-08-27T19:43:43Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Revisiting 3D Context Modeling with Supervised Pre-training for
Universal Lesion Detection in CT Slices [48.85784310158493]
We propose a Modified Pseudo-3D Feature Pyramid Network (MP3D FPN) to efficiently extract 3D context enhanced 2D features for universal lesion detection in CT slices.
With the novel pre-training method, the proposed MP3D FPN achieves state-of-the-art detection performance on the DeepLesion dataset.
The proposed 3D pre-trained weights can potentially be used to boost the performance of other 3D medical image analysis tasks.
arXiv Detail & Related papers (2020-12-16T07:11:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.