EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy
- URL: http://arxiv.org/abs/2505.15206v1
- Date: Wed, 21 May 2025 07:35:00 GMT
- Title: EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy
- Authors: Chi Kit Ng, Long Bai, Guankun Wang, Yupeng Wang, Huxin Gao, Kun Yuan, Chenhan Jin, Tieyong Zeng, Hongliang Ren,
- Abstract summary: Vision-Language-Action (VLA) models integrate visual perception, language grounding, and motion planning within an end-to-end framework.<n>EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting.
- Score: 26.132684811981143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In endoscopic procedures, autonomous tracking of abnormal regions and following circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile for each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, leading to poor generalization across diverse scenes. Vision-Language-Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative by semantically adapting to surgeon prompts without manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Given endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting. To tackle data scarcity and domain shifts, we propose a dual-phase strategy comprising supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning with task-aware rewards. Our approach significantly improves tracking performance in endoscopy and enables zero-shot generalization in diverse scenes and complex sequential tasks.
Related papers
- EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Control [10.426745597034204]
We introduce EndoControlMag, a training-free framework with mask-conditioned vascular motion magnification tailored to endoscopic environments.<n>Our approach features two key modules: a Periodic Reference Resetting scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation.<n>We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios.
arXiv Detail & Related papers (2025-07-21T06:47:44Z) - EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery [11.286605039002419]
Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery.<n>Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task.<n>We propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation.
arXiv Detail & Related papers (2025-06-07T15:18:43Z) - Landmark-Free Preoperative-to-Intraoperative Registration in Laparoscopic Liver Resection [50.388465935739376]
Liver registration by overlaying preoperative 3D models onto intraoperative 2D frames can assist surgeons in perceiving the spatial anatomy of the liver clearly for a higher surgical success rate.<n>Existing registration methods rely heavily on anatomical landmark-based, which encounter two major limitations.<n>We propose a landmark-free preoperative-to-intraoperative registration framework utilizing effective self-supervised learning.
arXiv Detail & Related papers (2025-04-21T14:55:57Z) - EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance [79.66329903007869]
We present EchoWorld, a motion-aware world modeling framework for probe guidance.<n>It encodes anatomical knowledge and motion-induced visual dynamics.<n>It is trained on more than one million ultrasound images from over 200 routine scans.
arXiv Detail & Related papers (2025-04-17T16:19:05Z) - Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision [3.290418382279656]
Endo-TTAP is a novel framework for tissue point tracking in endoscopic videos.<n>MFGA module synergizes multi-scale flow dynamics, DINOv2 semantic embeddings, and explicit motion patterns to jointly predict point positions.<n> Stage I utilizes synthetic data with optical flow ground truth for uncertainty-occlusion regularization.<n> Stage II combines unsupervised flow consistency and semi-supervised learning with refined pseudo-labels from off-the-shelf trackers.
arXiv Detail & Related papers (2025-03-28T13:00:07Z) - Multi-Scale Feature Fusion with Image-Driven Spatial Integration for Left Atrium Segmentation from Cardiac MRI Images [0.0]
We propose a framework that integrates DINOv2 as an encoder with a UNet-style decoder, incorporating multi-scale feature fusion and input image integration.<n>We validate our approach on the LAScarQS 2022 dataset and demonstrate improved performance with a 92.3% Dice and 84.1% IoU score for giant architecture.
arXiv Detail & Related papers (2025-02-10T16:12:46Z) - Multi-Layer Gaussian Splatting for Immersive Anatomy Visualization [1.0580610673031074]
In medical image visualization, path tracing of volumetric medical data like CT scans produces lifelike visualizations.
We propose a novel approach utilizing GS to create an efficient but static intermediate representation of CT scans.
Our approach achieves interactive frame rates while preserving anatomical structures, with quality adjustable to the target hardware.
arXiv Detail & Related papers (2024-10-22T12:56:58Z) - Efficient Multi-View Fusion and Flexible Adaptation to View Missing in Cardiovascular System Signals [4.519437028632205]
Deep learning has facilitated automatic multi-view fusion (MVF) about the cardiovascular system (CVS) signals.
MVF model architecture often amalgamates CVS signals from the same temporal step but different views into a unified representation.
We introduce prompt techniques to aid pretrained MVF models in flexibly adapting to various missing-view scenarios.
arXiv Detail & Related papers (2024-06-13T08:58:59Z) - Real-time guidewire tracking and segmentation in intraoperative x-ray [52.51797358201872]
We propose a two-stage deep learning framework for real-time guidewire segmentation and tracking.
In the first stage, a Yolov5 detector is trained, using the original X-ray images as well as synthetic ones, to output the bounding boxes of possible target guidewires.
In the second stage, a novel and efficient network is proposed to segment the guidewire in each detected bounding box.
arXiv Detail & Related papers (2024-04-12T20:39:19Z) - CathFlow: Self-Supervised Segmentation of Catheters in Interventional Ultrasound Using Optical Flow and Transformers [66.15847237150909]
We introduce a self-supervised deep learning architecture to segment catheters in longitudinal ultrasound images.
The network architecture builds upon AiAReSeg, a segmentation transformer built with the Attention in Attention mechanism.
We validated our model on a test dataset, consisting of unseen synthetic data and images collected from silicon aorta phantoms.
arXiv Detail & Related papers (2024-03-21T15:13:36Z) - Inflated 3D Convolution-Transformer for Weakly-supervised Carotid
Stenosis Grading with Ultrasound Videos [12.780908780402516]
We present the first video classification framework for automatic carotid stenosis grading (CSG)
We propose a novel and effective video classification network for weakly-supervised CSG.
Our approach is extensively validated on a large clinically collected carotid US video dataset.
arXiv Detail & Related papers (2023-06-05T02:50:06Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z) - Robust Medical Instrument Segmentation Challenge 2019 [56.148440125599905]
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions.
Our challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures.
The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap.
arXiv Detail & Related papers (2020-03-23T14:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.