DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation
- URL: http://arxiv.org/abs/2510.17038v1
- Date: Sun, 19 Oct 2025 22:59:32 GMT
- Title: DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation
- Authors: Pedram Fekri, Majid Roshanfar, Samuel Barbeau, Seyedfarzad Famouri, Thomas Looi, Dale Podolsky, Mehrdad Zadeh, Javad Dargahi,
- Abstract summary: This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework.<n>The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware.<n>Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline.
- Score: 0.33727511459109777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL) [57.65363326406228]
We propose a Hierarchical Modular Multi-Agent Reinforcement Learning framework for autonomous two-device navigation in vitro.<n>HM-MARL was developed to autonomously navigate a guide catheter and guidewire from the femoral artery to the internal carotid artery (ICA)<n>A modular multi-agent approach was used to decompose the complex navigation task into specialized subtasks, each trained using Soft Actor-Critic RL.<n>In vitro, both HM-MARL models successfully navigated 100% of trials from the femoral artery to the right common carotid artery and 80% to the right ICA but failed on the left-side vessel challenge
arXiv Detail & Related papers (2026-02-20T23:50:35Z) - ReclAIm: A multi-agent framework for degradation-aware performance tuning of medical imaging AI [0.0]
ReclAIm is a multi-agent framework capable of autonomously monitoring, evaluating, and fine-tuning medical image classification models.<n>It successfully trains, evaluates, and maintains consistent performance of models across MRI, CT, and X-ray datasets.
arXiv Detail & Related papers (2025-10-19T21:02:01Z) - Autonomous Soft Robotic Guidewire Navigation via Imitation Learning [3.1381624795986345]
In endovascular surgery, interventionists push a thin tube called a catheter, guided by a thin wire to a treatment site inside the patient's blood vessels.<n>Guidewires with robotic tips can enhance maneuverability, but they present challenges in modeling and control.<n>We develop a transformer-based imitation learning framework with goal conditioning, relative action outputs, and automatic contrast dye injections.
arXiv Detail & Related papers (2025-10-10T15:57:09Z) - AR Surgical Navigation with Surface Tracing: Comparing In-Situ Visualization with Tool-Tracking Guidance for Neurosurgical Applications [0.0]
This study presents a novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation.<n>The system registers target positions to the patient through a novel surface tracing method and uses real-time infrared tool tracking to aid in catheter placement.
arXiv Detail & Related papers (2025-08-14T11:46:30Z) - Guidance for Intra-cardiac Echocardiography Manipulation to Maintain Continuous Therapy Device Tip Visibility [7.208458407211938]
Intra-cardiac Echocardiography (ICE) plays a critical role in Electrophysiology (EP) and Structural Heart Disease (SHD) interventions.<n>Maintaining continuous visibility of the therapy device tip remains a challenge due to frequent adjustments required during manual ICE catheter manipulation.<n>We propose an AI-driven tracking model that estimates the device tip incident angle and passing point within the ICE imaging plane.
arXiv Detail & Related papers (2025-05-08T02:48:30Z) - AI-driven View Guidance System in Intra-cardiac Echocardiography Imaging [7.074445406436684]
Intra-cardiac echocardiography (ICE) is a crucial imaging modality used in electrophysiology (EP) and structural heart disease (SHD) interventions.<n>We propose an AIdriven view guidance system that operates in a continuous closed-loop with human-in-the-loop feedback.
arXiv Detail & Related papers (2024-09-25T13:08:10Z) - Causal Graph ODE: Continuous Treatment Effect Modeling in Multi-agent
Dynamical Systems [70.84976977950075]
Real-world multi-agent systems are often dynamic and continuous, where the agents co-evolve and undergo changes in their trajectories and interactions over time.
We propose a novel model that captures the continuous interaction among agents using a Graph Neural Network (GNN) as the ODE function.
The key innovation of our model is to learn time-dependent representations of treatments and incorporate them into the ODE function, enabling precise predictions of potential outcomes.
arXiv Detail & Related papers (2024-02-29T23:07:07Z) - Robotic Navigation Autonomy for Subretinal Injection via Intelligent
Real-Time Virtual iOCT Volume Slicing [88.99939660183881]
We propose a framework for autonomous robotic navigation for subretinal injection.
Our method consists of an instrument pose estimation method, an online registration between the robotic and the i OCT system, and trajectory planning tailored for navigation to an injection target.
Our experiments on ex-vivo porcine eyes demonstrate the precision and repeatability of the method.
arXiv Detail & Related papers (2023-01-17T21:41:21Z) - Towards Autonomous Atlas-based Ultrasound Acquisitions in Presence of
Articulated Motion [48.52403516006036]
This paper proposes a vision-based approach allowing autonomous robotic US limb scanning.
To this end, an atlas MRI template of a human arm with annotated vascular structures is used to generate trajectories.
In all cases, the system can successfully acquire the planned vascular structure on volunteers' limbs.
arXiv Detail & Related papers (2022-08-10T15:39:20Z) - ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [55.485985317538194]
ProcTHOR is a framework for procedural generation of Embodied AI environments.
We demonstrate state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.
arXiv Detail & Related papers (2022-06-14T17:09:35Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - Online Body Schema Adaptation through Cost-Sensitive Active Learning [63.84207660737483]
The work was implemented in a simulation environment, using the 7DoF arm of the iCub robot simulator.
A cost-sensitive active learning approach is used to select optimal joint configurations.
The results show cost-sensitive active learning has similar accuracy to the standard active learning approach, while reducing in about half the executed movement.
arXiv Detail & Related papers (2021-01-26T16:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.