Recognition and Prediction of Surgical Gestures and Trajectories Using
Transformer Models in Robot-Assisted Surgery
- URL: http://arxiv.org/abs/2212.01683v1
- Date: Sat, 3 Dec 2022 20:26:48 GMT
- Title: Recognition and Prediction of Surgical Gestures and Trajectories Using
Transformer Models in Robot-Assisted Surgery
- Authors: Chang Shi, Yi Zheng, Ann Majewicz Fey
- Abstract summary: Transformer models were first developed for Natural Language Processing (NLP) to model word sequences.
We propose the novel use of a Transformer model for three tasks: gesture recognition, gesture prediction, and trajectory prediction during RAS.
- Score: 10.719885390990433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical activity recognition and prediction can help provide important
context in many Robot-Assisted Surgery (RAS) applications, for example,
surgical progress monitoring and estimation, surgical skill evaluation, and
shared control strategies during teleoperation. Transformer models were first
developed for Natural Language Processing (NLP) to model word sequences and
soon the method gained popularity for general sequence modeling tasks. In this
paper, we propose the novel use of a Transformer model for three tasks: gesture
recognition, gesture prediction, and trajectory prediction during RAS. We
modify the original Transformer architecture to be able to generate the current
gesture sequence, future gesture sequence, and future trajectory sequence
estimations using only the current kinematic data of the surgical robot
end-effectors. We evaluate our proposed models on the JHU-ISI Gesture and Skill
Assessment Working Set (JIGSAWS) and use Leave-One-User-Out (LOUO)
cross-validation to ensure the generalizability of our results. Our models
achieve up to 89.3\% gesture recognition accuracy, 84.6\% gesture prediction
accuracy (1 second ahead) and 2.71mm trajectory prediction error (1 second
ahead). Our models are comparable to and able to outperform state-of-the-art
methods while using only the kinematic data channel. This approach can enable
near-real time surgical activity recognition and prediction.
Related papers
- Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - CaRTS: Causality-driven Robot Tool Segmentation from Vision and
Kinematics Data [11.92904350972493]
Vision-based segmentation of the robotic tool during robot-assisted surgery enables downstream applications, such as augmented reality feedback.
With the introduction of deep learning, many methods were presented to solve instrument segmentation directly and solely from images.
We present CaRTS, a causality-driven robot tool segmentation algorithm, that is designed based on a complementary causal model of the robot tool segmentation task.
arXiv Detail & Related papers (2022-03-15T22:26:19Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Future Frame Prediction for Robot-assisted Surgery [57.18185972461453]
We propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences.
Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools.
arXiv Detail & Related papers (2021-03-18T15:12:06Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - daVinciNet: Joint Prediction of Motion and Surgical State in
Robot-Assisted Surgery [13.928484202934651]
We propose daVinciNet - an end-to-end dual-task model for robot motion and surgical state predictions.
Our model achieves up to 93.85% short-term (0.5s) and 82.11% long-term (2s) state prediction accuracy, as well as 1.07mm short-term and 5.62mm long-term trajectory prediction error.
arXiv Detail & Related papers (2020-09-24T20:28:06Z) - Predictive Modeling of Periodic Behavior for Human-Robot Symbiotic
Walking [13.68799310875662]
We extend Interaction Primitives to periodic movement regimes, i.e., walking.
We show that this model is particularly well-suited for learning data-driven, customized models of human walking.
We also demonstrate how the same framework can be used to learn controllers for a robotic prosthesis.
arXiv Detail & Related papers (2020-05-27T03:30:48Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.