Related papers: One to Many: Adaptive Instrument Segmentation via Meta Learning and Dynamic Online Adaptation in Robotic Surgical Video

One to Many: Adaptive Instrument Segmentation via Meta Learning and Dynamic Online Adaptation in Robotic Surgical Video

URL: http://arxiv.org/abs/2103.12988v1
Date: Wed, 24 Mar 2021 05:02:18 GMT
Title: One to Many: Adaptive Instrument Segmentation via Meta Learning and Dynamic Online Adaptation in Robotic Surgical Video
Authors: Zixu Zhao, Yueming Jin, Bo Lu, Chi-Fai Ng, Qi Dou, Yun-Hui Liu, and Pheng-Ann Heng
Abstract summary: MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery. It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm. It outperforms other state-of-the-art methods on two datasets.
Score: 71.43912903508765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Surgical instrument segmentation in robot-assisted surgery (RAS) - especially that using learning-based models - relies on the assumption that training and testing videos are sampled from the same domain. However, it is impractical and expensive to collect and annotate sufficient data from every new domain. To greatly increase the label efficiency, we explore a new problem, i.e., adaptive instrument segmentation, which is to effectively adapt one source model to new robotic surgical videos from multiple target domains, only given the annotated instruments in the first frame. We propose MDAL, a meta-learning based dynamic online adaptive learning scheme with a two-stage framework to fast adapt the model parameters on the first frame and partial subsequent frames while predicting the results. MDAL learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm. The added gradient gate excludes the noisy supervision from pseudo masks for dynamic online adaptation on target videos. We demonstrate empirically that MDAL outperforms other state-of-the-art methods on two datasets (including a real-world RAS dataset). The promising performance on ex-vivo scenes also benefits the downstream tasks such as robot-assisted suturing and camera control.

Related papers

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos [29.799207502031496]
We introduce AMPLIFY, a novel framework that leverages large-scale video data.<n>We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples.<n>In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data.
arXiv Detail & Related papers (2025-06-17T05:31:42Z)
Attribute-Based Robotic Grasping with Data-Efficient Adaptation [19.683833436076313]
We present an end-to-end encoder-decoder network to learn attribute-based robotic grasping. Our approach achieves over 81% instance grasping success rate on unknown objects.
arXiv Detail & Related papers (2025-01-04T00:37:17Z)
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency. Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
XVO: Generalized Visual Odometry via Cross-Modal Self-Training [11.70220331540621]
XVO is a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube.
arXiv Detail & Related papers (2023-09-28T18:09:40Z)
Domain Adaptive Sim-to-Real Segmentation of Oropharyngeal Organs Towards Robot-assisted Intubation [15.795665057836636]
This work introduces a virtual dataset generated by the Open Framework Architecture framework to overcome the limited availability of actual endoscopic images. We also propose a domain adaptive Sim-to-Real method for oropharyngeal organ image segmentation, which employs an image blending strategy. Experimental results demonstrate the superior performance of the proposed approach with domain adaptive models.
arXiv Detail & Related papers (2023-05-19T14:08:15Z)
ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays. We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement. In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
Domain Adaptive Robotic Gesture Recognition with Unsupervised Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot. It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture. Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.