Related papers: RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

URL: http://arxiv.org/abs/2507.09097v1
Date: Sat, 12 Jul 2025 00:45:38 GMT
Title: RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze
Authors: Yunsoo Kim, Jinge Wu, Honghan Wu,
Abstract summary: RadEyeVideo integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze.<n>When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task.
Score: 2.4302611783073145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists' eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert's knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models' capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.

Related papers

ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays [1.9827390755712084]
Vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently.<n>We present ChestGPT, a framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images.<n>The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76.
arXiv Detail & Related papers (2025-07-04T17:58:52Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
X-GRM: Large Gaussian Reconstruction Model for Sparse-view X-rays to Computed Tomography [89.84588038174721]
Computed Tomography serves as an indispensable tool in clinical, providing non-invasive visualization of internal anatomical structures.<n>Existing CT reconstruction works are limited to small-capacity model architecture and inflexible volume representation.<n>We present X-GRM, a large feedforward model for reconstructing 3D CT volumes from sparse-view 2D X-ray projections.
arXiv Detail & Related papers (2025-05-21T08:14:10Z)
Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation [21.772106685777995]
We introduce a radiology-focused visual language model designed to generate radiology reports from chest X-rays.<n>Our model combines an image encoder with a fine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate different sections of a radiology report with notable accuracy.
arXiv Detail & Related papers (2024-12-06T11:14:03Z)
Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns [7.6599164274971026]
Vision-Language Models (VLMs) enhanced with radiologists' attention by incorporating eye gaze data alongside textual prompts. Heatmaps generated from eye gaze data, overlaying them onto medical images to highlight areas of intense radiologist's focus. Results demonstrate the inclusion of eye gaze information significantly enhances the accuracy of chest X-ray analysis.
arXiv Detail & Related papers (2024-04-03T00:09:05Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
Endora: Video Generation Models as Endoscopy Simulators [53.72175969751398]
This paper introduces model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes. We also pioneer the first public benchmark for endoscopy simulation with video generation models. Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research.
arXiv Detail & Related papers (2024-03-17T00:51:59Z)
Intensive Vision-guided Network for Radiology Report Generation [22.030289124516326]
We propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception. We also explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction.
arXiv Detail & Related papers (2024-02-06T06:46:46Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [72.8965643836841]
We introduce XrayGPT, a novel conversational medical vision-language model.<n>It can analyze and answer open-ended questions about chest radiographs.<n>We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.