CaRTS: Causality-driven Robot Tool Segmentation from Vision and
Kinematics Data
- URL: http://arxiv.org/abs/2203.09475v1
- Date: Tue, 15 Mar 2022 22:26:19 GMT
- Title: CaRTS: Causality-driven Robot Tool Segmentation from Vision and
Kinematics Data
- Authors: Hao Ding, Jintan Zhang, Peter Kazanzides, Jieying Wu, and Mathias
Unberath
- Abstract summary: Vision-based segmentation of the robotic tool during robot-assisted surgery enables downstream applications, such as augmented reality feedback.
With the introduction of deep learning, many methods were presented to solve instrument segmentation directly and solely from images.
We present CaRTS, a causality-driven robot tool segmentation algorithm, that is designed based on a complementary causal model of the robot tool segmentation task.
- Score: 11.92904350972493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-based segmentation of the robotic tool during robot-assisted surgery
enables downstream applications, such as augmented reality feedback, while
allowing for inaccuracies in robot kinematics. With the introduction of deep
learning, many methods were presented to solve instrument segmentation directly
and solely from images. While these approaches made remarkable progress on
benchmark datasets, fundamental challenges pertaining to their robustness
remain. We present CaRTS, a causality-driven robot tool segmentation algorithm,
that is designed based on a complementary causal model of the robot tool
segmentation task. Rather than directly inferring segmentation masks from
observed images, CaRTS iteratively aligns tool models with image observations
by updating the initially incorrect robot kinematic parameters through forward
kinematics and differentiable rendering to optimize image feature similarity
end-to-end. We benchmark CaRTS with competing techniques on both synthetic as
well as real data from the dVRK, generated in precisely controlled scenarios to
allow for counterfactual synthesis. On training-domain test data, CaRTS
achieves a Dice score of 93.4 that is preserved well (Dice score of 91.8) when
tested on counterfactual altered test data, exhibiting low brightness, smoke,
blood, and altered background patterns. This compares favorably to Dice scores
of 95.0 and 62.8, respectively, of a purely image-based method trained and
tested on the same data. Future work will involve accelerating CaRTS to achieve
video framerate and estimating the impact occlusion has in practice. Despite
these limitations, our results are promising: In addition to achieving high
segmentation accuracy, CaRTS provides estimates of the true robot kinematics,
which may benefit applications such as force estimation.
Related papers
- Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets [24.77850617214567]
We propose a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks.
Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions.
We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss.
arXiv Detail & Related papers (2024-10-29T17:58:13Z) - Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin-based Scene Representation [14.108636146958007]
End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks.
Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm.
This approach takes advantage of the recent vision foundation models that ensure reliable low-level scene understanding.
arXiv Detail & Related papers (2024-10-26T00:49:06Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - A Kinematic Bottleneck Approach For Pose Regression of Flexible Surgical
Instruments directly from Images [17.32860829016479]
We propose a self-supervised image-based method, exploiting, at training time only, the kinematic information provided by the robot.
In order to avoid introducing time-consuming manual annotations, the problem is formulated as an auto-encoder.
Validation of the method was performed on semi-synthetic, phantom and in-vivo datasets, obtained using a flexible robotized endoscope.
arXiv Detail & Related papers (2021-02-28T18:41:18Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.