Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation
- URL: http://arxiv.org/abs/2508.13068v1
- Date: Mon, 18 Aug 2025 16:42:29 GMT
- Title: Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation
- Authors: Tanjim Islam Riju, Shuchismita Anwar, Saman Sarker Joy, Farig Sadeque, Swakkhar Shatabda,
- Abstract summary: We propose a two-stage framework that enhances disease classification and region-aware radiology report generation from chest X-rays.<n>In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification.<n>In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords.
- Score: 1.5087814338685968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and center-of-mass alignment. Incorporating fixations improves F1 score from 0.597 to 0.631 (+5.70%) and AUC from 0.821 to 0.849 (+3.41%), while also improving precision and recall, highlighting the effectiveness of gaze-informed attention supervision. In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords, maps them to anatomical regions using a curated dictionary constructed from domain-specific priors, and generates region-aligned sentences via structured prompts. This pipeline improves report quality as measured by clinical keyword recall and ROUGE overlap. Our results demonstrate that integrating gaze data improves both classification performance and the interpretability of generated medical reports.
Related papers
- A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation [15.331803613974365]
We propose a novel dual-stage disease-aware framework for chest X-ray report generation.<n>In Stage1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories.<n>In Stage2, we introduce a Disease-Visual Attention Fusion module to integrate disease-aware representations with visual features.
arXiv Detail & Related papers (2025-11-15T15:31:51Z) - An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection [55.35661671061754]
Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas.<n>We propose a framework which enhances disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head.<n>Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection.
arXiv Detail & Related papers (2025-10-21T17:18:55Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning [11.537036709742345]
DiagCoT is a framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs)<n>DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency.<n>It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets.
arXiv Detail & Related papers (2025-09-08T08:01:26Z) - PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation [12.860257420677122]
PriorRG is a chest X-ray report generation framework that emulates real-world clinical via a two-stage training pipeline.<n>In Stage 1, we introduce a prior-guided contrast pre-training scheme that leverages clinical context guidetemporal feature extraction.<n>In Stage 2, we integrate a prior-aware coarsetemporal decoding for generation that enhances prior knowledge with the vision encoders hidden states.
arXiv Detail & Related papers (2025-08-07T13:02:20Z) - Advancing Chronic Tuberculosis Diagnostics Using Vision-Language Models: A Multi modal Framework for Precision Analysis [0.0]
This study proposes a Vision-Language Model (VLM) to enhance automated chronic tuberculosis (TB) screening.<n>By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation.<n>The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies.
arXiv Detail & Related papers (2025-03-17T13:49:29Z) - EVOKE: Elevating Chest X-ray Report Generation via Multi-View Contrastive Learning and Patient-Specific Knowledge [21.596462896333733]
textbfEVOKE is a novel chest X-ray report generation framework that incorporates multi-view contrastive learning and patient-specific knowledge.<n>We present a knowledge-guided report generation module that integrates available patient-specific indications.<n>Our proposed EVOKE surpasses recent state-of-the-art methods across multiple datasets.
arXiv Detail & Related papers (2024-11-15T14:38:13Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report
Generation [47.250147322130545]
Image-to-text radiology report generation aims to automatically produce radiology reports that describe the findings in medical images.
Most existing methods focus solely on the image data, disregarding the other patient information accessible to radiologists.
We present a novel multi-modal deep neural network framework for generating chest X-rays reports by integrating structured patient data, such as vital signs and symptoms, alongside unstructured clinical notes.
arXiv Detail & Related papers (2023-11-18T14:37:53Z) - Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records.
The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z) - Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG)
CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure.
Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z) - Factored Attention and Embedding for Unstructured-view Topic-related
Ultrasound Report Generation [70.7778938191405]
We propose a novel factored attention and embedding model (termed FAE-Gen) for the unstructured-view topic-related ultrasound report generation.
The proposed FAE-Gen mainly consists of two modules, i.e., view-guided factored attention and topic-oriented factored embedding, which capture the homogeneous and heterogeneous morphological characteristic across different views.
arXiv Detail & Related papers (2022-03-12T15:24:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.