Visuomotor Grasping with World Models for Surgical Robots
- URL: http://arxiv.org/abs/2508.11200v1
- Date: Fri, 15 Aug 2025 04:23:07 GMT
- Title: Visuomotor Grasping with World Models for Surgical Robots
- Authors: Hongbin Lin, Bin Li, Kwok Wai Samuel Au,
- Abstract summary: We introduce Grasp Anything for Surgery V2 (GASv2), a visuomotor learning framework for surgical grasping.<n>We train the policy in simulation using domain randomization for sim-to-real transfer and deploy it on a real robot in both phantom-based and ex vivo surgical settings.<n>Experiments show our policy achieves a 65% success rate in both settings, generalizes to unseen objects and grippers, and adapts to diverse disturbances.
- Score: 6.228255257808355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grasping is a fundamental task in robot-assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents unique challenges, such as low signal-to-noise ratio in visual observations, demands for high safety and millimeter-level precision, as well as the complex surgical environment. This paper addresses three key challenges: (i) sim-to-real transfer of visuomotor policies to ex vivo surgical scenes, (ii) visuomotor learning using only a single stereo camera pair -- the standard RAS setup, and (iii) object-agnostic grasping with a single policy that generalizes to diverse, unseen surgical objects without retraining or task-specific models. We introduce Grasp Anything for Surgery V2 (GASv2), a visuomotor learning framework for surgical grasping. GASv2 leverages a world-model-based architecture and a surgical perception pipeline for visual observations, combined with a hybrid control system for safe execution. We train the policy in simulation using domain randomization for sim-to-real transfer and deploy it on a real robot in both phantom-based and ex vivo surgical settings, using only a single pair of endoscopic cameras. Extensive experiments show our policy achieves a 65% success rate in both settings, generalizes to unseen objects and grippers, and adapts to diverse disturbances, demonstrating strong performance, generality, and robustness.
Related papers
- MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts [1.6646268910871171]
We present a supervised Mixture-of-Experts architecture designed for phase-structured surgical manipulation tasks.<n>We show that a lightweight action decoder policy can learn complex, long-horizon manipulation from less than 150 demonstrations.<n>We present preliminary results of policy roll-outs during in vivo porcine surgery.
arXiv Detail & Related papers (2026-01-29T16:50:14Z) - Evaluating Gemini Robotics Policies in a Veo World Simulator [69.23071832313246]
We introduce a generative evaluation system built upon a frontier video foundation model (Veo)<n>The system is optimized to support robot action conditioning and multi-view consistency.<n>We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
arXiv Detail & Related papers (2025-12-11T14:22:14Z) - Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer [59.02729900344616]
GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning.<n>We develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation.<n>This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.
arXiv Detail & Related papers (2025-11-30T20:07:13Z) - SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement [8.337819078911405]
SurgVisAgent is an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs)<n>It dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks.<n>We construct a benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models.
arXiv Detail & Related papers (2025-07-03T03:00:26Z) - EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy [26.132684811981143]
Vision-Language-Action (VLA) models integrate visual perception, language grounding, and motion planning within an end-to-end framework.<n>EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting.
arXiv Detail & Related papers (2025-05-21T07:35:00Z) - Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids [56.892520712892804]
We introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three dexterous manipulation tasks.<n>We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors.
arXiv Detail & Related papers (2025-02-27T18:59:52Z) - AMNCutter: Affinity-Attention-Guided Multi-View Normalized Cutter for Unsupervised Surgical Instrument Segmentation [7.594796294925481]
We propose a label-free unsupervised model featuring a novel module named Multi-View Normalized Cutter (m-NCutter)
Our model is trained using a graph-cutting loss function that leverages patch affinities for supervision, eliminating the need for pseudo-labels.
We conduct comprehensive experiments across multiple SIS datasets to validate our approach's state-of-the-art (SOTA) performance, robustness, and exceptional potential as a pre-trained model.
arXiv Detail & Related papers (2024-11-06T06:33:55Z) - World Models for General Surgical Grasping [7.884835348797252]
We propose a world-model-based deep reinforcement learning framework "Grasp Anything for Surgery" (GAS)
We learn a pixel-level visuomotor policy for surgical grasping, enhancing both generality and robustness.
Our system also demonstrates significant robustness across 6 conditions including background variation, target disturbance, camera pose variation, kinematic control error, image noise, and re-grasping after the gripped target object drops from the gripper.
arXiv Detail & Related papers (2024-05-28T08:11:12Z) - Robotic Constrained Imitation Learning for the Peg Transfer Task in Fundamentals of Laparoscopic Surgery [18.64205729932939]
We present an implementation strategy for a robot that performs peg transfer tasks in Fundamentals of Laparoscopic Surgery (FLS) via imitation learning.
In this study, we achieve more accurate imitation learning with only monocular images.
We implemented an overall system using two Franka Emika Panda Robot Arms and validated its effectiveness.
arXiv Detail & Related papers (2024-05-06T13:12:25Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose Estimation of Surgical Instruments [64.59698930334012]
We present a multi-camera capture setup consisting of static and head-mounted cameras.<n>Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.<n>Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.