A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling
- URL: http://arxiv.org/abs/2601.12820v1
- Date: Mon, 19 Jan 2026 08:30:09 GMT
- Title: A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling
- Authors: Wei Chen, Liang Wu, Shuyi Lu, Yuanyuan Sun, Wenkai Bi, Zilong Yuan, Yaoyao He, Feng Wang, Junchi Ma, Shuyong Liu, Zhaoping Cheng, Xiaoyan Hu, Jianfeng Qiu,
- Abstract summary: SDF-HOLO is a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients.<n>It decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module.<n>The model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions.
- Score: 17.709593152803695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.
Related papers
- Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images [0.0]
unsupervised domain adaptation (UDA) framework is proposed, utilizing a Transformer backbone and a Mean-Teacher architecture for cross-center semantic segmentation.<n>The primary focus is placed on enhancing pseudo-label reliability by learning deep structural information within the feature space.<n> Experimental validation conducted on cross-center datasets (FUMPE and CAD-PE) demonstrates significant performance gains.
arXiv Detail & Related papers (2026-02-23T14:33:24Z) - OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis [53.01523944168442]
Clinical interpretation relies on both slice-driven local features and volume-driven spatial representations.<n>Existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding.<n>We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios.
arXiv Detail & Related papers (2026-02-18T00:42:41Z) - A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - Vision-Language Models for Automated 3D PET/CT Report Generation [13.781844347232079]
AutoPET/CT report generation is increasingly important for reducing clinical workload.<n>PETRG-3D is an end-to-end 3D dual-branch framework that encodes PET and CT volumes and incorporates style-adaptive prompts.<n>PETRG-Score is a lymphoma-specific evaluation protocol that measures metabolic and structural findings across curated anatomical regions.
arXiv Detail & Related papers (2025-11-25T10:07:57Z) - FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising [55.04342933312839]
We propose FoundDiff, a foundational diffusion model for unified and generalizable low-dose computed tomography (CT) denoising.<n>FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising.<n>First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception.<n>Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising.
arXiv Detail & Related papers (2025-08-24T11:03:56Z) - RadFabric: Agentic AI System with Reasoning Capability for Radiology [61.25593938175618]
RadFabric is a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation.<n>System employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses.
arXiv Detail & Related papers (2025-06-17T03:10:33Z) - CodeBrain: Towards Decoupled Interpretability and Multi-Scale Architecture for EEG Foundation Model [52.466542039411515]
EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models.<n>We present CodeBrain, a two-stage EFM designed to fill this gap.<n>In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens.<n>In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention.
arXiv Detail & Related papers (2025-06-10T17:20:39Z) - Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging [39.59895695500171]
We introduce the Cross-Fraternal Twin Masked Autoencoder (FratMAE), a novel framework that effectively integrates whole-body anatomical and functional information.<n>FratMAE captures intricate cross-modal relationships and global uptake patterns, achieving superior performance on downstream tasks.
arXiv Detail & Related papers (2025-03-04T17:49:07Z) - H2ASeg: Hierarchical Adaptive Interaction and Weighting Network for Tumor Segmentation in PET/CT Images [6.753315684414596]
Positron emission tomography (PET) combined with computed tomography (CT) imaging is routinely used in cancer diagnosis and prognosis.
Traditional multi-modal segmentation solutions rely on concatenation operations for modality fusion.
We propose a Hierarchical Adaptive Interaction and Weighting Network termed H2ASeg to explore intrinsic cross-modal correlations.
arXiv Detail & Related papers (2024-03-27T08:28:14Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Multimodal Spatial Attention Module for Targeting Multimodal PET-CT Lung
Tumor Segmentation [11.622615048002567]
Multimodal spatial attention module (MSAM) learns to emphasize regions related to tumors.
MSAM can be applied to common backbone architectures and trained end-to-end.
arXiv Detail & Related papers (2020-07-29T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.