Related papers: GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans

GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans

URL: http://arxiv.org/abs/2603.00162v1
Date: Thu, 26 Feb 2026 04:39:15 GMT
Title: GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans
Authors: Joy T Wu, Daniel Beckmann, Sarah Miller, Alexander Lee, Elizabeth Theng, Stephan Altmayer, Ken Chang, David Kersting, Tomoaki Otani, Brittany Z Dashevsky, Hye Lim Park, Matteo Novello, Kip Guja, Curtis Langlotz, Ismini Lourentzou, Daniel Gruhl, Benjamin Risse, Guido A Davidzon,
Abstract summary: We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement.<n>From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for machine learning applications.
Score: 32.135469601806754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: [18F]FDG-PET/CT is a cornerstone imaging modality for tumor staging and treatment response assessment across many cancer types, yet expert reader shortages necessitate more efficient diagnostic aids. While standalone AI models for automatic lesion segmentation exist, clinical translation remains hindered by concerns about interpretability, explainability, reliability, and workflow integration. We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement on 346 FDG-PET/CT scans. Each study was read by a trainee and a board-certified nuclear medicine or radiology specialist using an eye-tracking-enabled annotation platform that simulates routine clinical reads. From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for multiple machine learning applications. Baseline validation experiments demonstrate that a 3D nnUNet tumor segmentation model achieved superior performance when incorporating expert gaze patterns versus without (DICE score 0.6819 versus 0.6008), and that vision transformers trained on sequential gaze and PET/CT images can improve dynamic lesion localization (74.95% predicted gaze point closer to tumor) and expert intention prediction (Accuracy 67.53% and AUROC 0.747). GazeXPErT is a valuable resource designed to explore multiple machine learning problems beyond these baseline experiments, which include and are not limited to, visual grounding or causal reasoning, clinically explainable feature augmentation, human-computer interaction, human intention prediction or understanding, and expert gaze-rewarded modeling approaches to AI in oncologic FDG-PET/CT imaging.

Related papers

Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench [48.60251555171943]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities.<n>In this work, we quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors.<n>We introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies.<n>Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic
arXiv Detail & Related papers (2026-01-06T05:58:50Z)
Vision-Language Models for Automated 3D PET/CT Report Generation [13.781844347232079]
AutoPET/CT report generation is increasingly important for reducing clinical workload.<n>PETRG-3D is an end-to-end 3D dual-branch framework that encodes PET and CT volumes and incorporates style-adaptive prompts.<n>PETRG-Score is a lymphoma-specific evaluation protocol that measures metabolic and structural findings across curated anatomical regions.
arXiv Detail & Related papers (2025-11-25T10:07:57Z)
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans. Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z)
AutoPET III Challenge: Tumor Lesion Segmentation using ResEnc-Model Ensemble [1.3467243219009812]
We trained a 3D Residual encoder U-Net within the no new U-Net framework to generalize the performance of automatic lesion segmentation. We leveraged test-time augmentations and other post-processing techniques to enhance tumor lesion segmentation. Our team currently hold the top position in the Auto-PET III challenge and outperformed the challenge baseline model in the preliminary test set with Dice score of 0.9627.
arXiv Detail & Related papers (2024-09-19T20:18:39Z)
Autopet III challenge: Incorporating anatomical knowledge into nnUNet for lesion segmentation in PET/CT [4.376648893167674]
The autoPET III Challenge focuses on advancing automated segmentation of tumor lesions in PET/CT images. We developed a classifier that identifies the tracer of the given PET/CT based on the Maximum Intensity Projection of the PET scan. Our final submission achieves cross-validation Dice scores of 76.90% and 61.33% for the publicly available FDG and PSMA datasets.
arXiv Detail & Related papers (2024-09-18T17:16:57Z)
From FDG to PSMA: A Hitchhiker's Guide to Multitracer, Multicenter Lesion Segmentation in PET/CT Imaging [0.9384264274298444]
We present our solution for the autoPET III challenge, targeting multitracer, multicenter generalization using the nnU-Net framework with the ResEncL architecture. Key techniques include misalignment data augmentation and multi-modal pretraining across CT, MR, and PET datasets. Compared to the default nnU-Net, which achieved a Dice score of 57.61, our model significantly improved performance with a Dice score of 68.40, alongside a reduction in false positive (FPvol: 7.82) and false negative (FNvol: 10.35) volumes.
arXiv Detail & Related papers (2024-09-14T16:39:17Z)
Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation [42.06416052431378]
2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy. We collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning to train BrainGPT models to generate radiology-adherent 3D brain CT reports. Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.
arXiv Detail & Related papers (2024-07-02T12:58:35Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
Swin-Tempo: Temporal-Aware Lung Nodule Detection in CT Scans as Video Sequences Using Swin Transformer-Enhanced UNet [2.7547288571938795]
We present an innovative model that harnesses the strengths of both convolutional neural networks and vision transformers. Inspired by object detection in videos, we treat each 3D CT image as a video, individual slices as frames, and lung nodules as objects, enabling a time-series application.
arXiv Detail & Related papers (2023-10-05T07:48:55Z)
EMT-NET: Efficient multitask network for computer-aided diagnosis of breast cancer [58.720142291102135]
We propose an efficient and light-weighted learning architecture to classify and segment breast tumors simultaneously. We incorporate a segmentation task into a tumor classification network, which makes the backbone network learn representations focused on tumor regions. The accuracy, sensitivity, and specificity of tumor classification is 88.6%, 94.1%, and 85.3%, respectively.
arXiv Detail & Related papers (2022-01-13T05:24:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.