PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography
- URL: http://arxiv.org/abs/2508.04062v1
- Date: Wed, 06 Aug 2025 03:46:51 GMT
- Title: PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography
- Authors: Yichi Zhang, Wenbo Zhang, Zehui Ling, Gang Feng, Sisi Peng, Deshu Chen, Yuchen Liu, Hongwei Zhang, Shuqi Wang, Lanlan Li, Limei Han, Yuan Cheng, Zixin Hu, Yuan Qi, Le Xue,
- Abstract summary: Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming.<n>Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications.<n>We introduce PET2Rep, a benchmark for evaluation of general and medical VLMs for radiology report generation for PET images.
- Score: 24.091435019102587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficiency metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.
Related papers
- Personalized MR-Informed Diffusion Models for 3D PET Image Reconstruction [44.89560992517543]
We propose a simple method for generating subject-specific PET images from a dataset of PET-MR scans.<n>The images we synthesize retain information from the subject's MR scan, leading to higher resolution and the retention of anatomical features.<n>With simulated and real [$18$F]FDG datasets, we show that pre-training a personalized diffusion model with subject-specific "pseudo-PET" images improves reconstruction accuracy with low-count data.
arXiv Detail & Related papers (2025-06-04T10:24:14Z) - Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging [39.59895695500171]
We introduce the Cross-Fraternal Twin Masked Autoencoder (FratMAE), a novel framework that effectively integrates whole-body anatomical and functional information.<n>FratMAE captures intricate cross-modal relationships and global uptake patterns, achieving superior performance on downstream tasks.
arXiv Detail & Related papers (2025-03-04T17:49:07Z) - SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images [21.883098685700666]
We develop SegAnyPET, a modality-specific 3D foundation model for universal promptable segmentation from PET images.<n> Experimental results demonstrate that SegAnyPET can segment seen and unseen target organs using only one or a few prompt points.
arXiv Detail & Related papers (2025-02-20T08:17:13Z) - Autopet III challenge: Incorporating anatomical knowledge into nnUNet for lesion segmentation in PET/CT [4.376648893167674]
The autoPET III Challenge focuses on advancing automated segmentation of tumor lesions in PET/CT images.
We developed a classifier that identifies the tracer of the given PET/CT based on the Maximum Intensity Projection of the PET scan.
Our final submission achieves cross-validation Dice scores of 76.90% and 61.33% for the publicly available FDG and PSMA datasets.
arXiv Detail & Related papers (2024-09-18T17:16:57Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - LLM-driven Multimodal Target Volume Contouring in Radiation Oncology [46.23891509553877]
Large language models (LLMs) can facilitate the integration of the textural information and images.
We present a novel LLM-driven multimodal AI, namely LLMSeg, that is applicable to the challenging task of target volume contouring for radiation therapy.
We demonstrate that the proposed model exhibits markedly improved performance compared to conventional unimodal AI models.
arXiv Detail & Related papers (2023-11-03T13:38:42Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Score-Based Generative Models for PET Image Reconstruction [38.72868748574543]
We propose several PET-specific adaptations of score-based generative models.
The proposed framework is developed for both 2D and 3D PET.
In addition, we provide an extension to guided reconstruction using magnetic resonance images.
arXiv Detail & Related papers (2023-08-27T19:43:43Z) - Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine
PET Reconstruction [62.29541106695824]
This paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM)
By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved.
Two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process.
arXiv Detail & Related papers (2023-08-20T04:10:36Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [72.8965643836841]
We introduce XrayGPT, a novel conversational medical vision-language model.<n>It can analyze and answer open-ended questions about chest radiographs.<n>We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG)
CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure.
Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.