Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery
- URL: http://arxiv.org/abs/2011.01619v2
- Date: Tue, 29 Jun 2021 05:52:38 GMT
- Title: Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery
- Authors: Yonghao Long, Jie Ying Wu, Bo Lu, Yueming Jin, Mathias Unberath,
Yun-Hui Liu, Pheng Ann Heng and Qi Dou
- Abstract summary: We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
- Score: 84.73764603474413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic surgical gesture recognition is fundamentally important to enable
intelligent cognitive assistance in robotic surgery. With recent advancement in
robot-assisted minimally invasive surgery, rich information including surgical
videos and robotic kinematics can be recorded, which provide complementary
knowledge for understanding surgical gestures. However, existing methods either
solely adopt uni-modal data or directly concatenate multi-modal
representations, which can not sufficiently exploit the informative
correlations inherent in visual and kinematics data to boost gesture
recognition accuracies. In this regard, we propose a novel online approach of
multi-modal relational graph network (i.e., MRG-Net) to dynamically integrate
visual and kinematics information through interactive message propagation in
the latent feature space. In specific, we first extract embeddings from video
and kinematics sequences with temporal convolutional networks and LSTM units.
Next, we identify multi-relations in these multi-modal embeddings and leverage
them through a hierarchical relational graph learning module. The effectiveness
of our method is demonstrated with state-of-the-art results on the public
JIGSAWS dataset, outperforming current uni-modal and multi-modal methods on
both suturing and knot typing tasks. Furthermore, we validated our method on
in-house visual-kinematics datasets collected with da Vinci Research Kit (dVRK)
platforms in two centers, with consistent promising performance achieved.
Related papers
- Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.
Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - Efficient Surgical Tool Recognition via HMM-Stabilized Deep Learning [25.146476653453227]
We propose an HMM-stabilized deep learning method for tool presence detection.
A range of experiments confirm that the proposed approaches achieve better performance with lower training and running costs.
These results suggest that popular deep learning approaches with over-complicated model structures may suffer from inefficient utilization of data.
arXiv Detail & Related papers (2024-04-07T15:27:35Z) - Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training [0.850206009406913]
MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion.
By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
arXiv Detail & Related papers (2024-01-22T15:11:57Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Complex Human Action Recognition in Live Videos Using Hybrid FR-DL
Method [1.027974860479791]
We address challenges of the preprocessing phase, by an automated selection of representative frames among the input sequences.
We propose a hybrid technique using background subtraction and HOG, followed by application of a deep neural network and skeletal modelling method.
We name our model as Feature Reduction & Deep Learning based action recognition method, or FR-DL in short.
arXiv Detail & Related papers (2020-07-06T15:12:50Z) - Multi-Task Recurrent Neural Network for Surgical Gesture Recognition and
Progress Prediction [17.63619129438996]
We propose a multi-task recurrent neural network for simultaneous recognition of surgical gestures and estimation of a novel formulation of surgical task progress.
We demonstrate that recognition performance improves in multi-task frameworks with progress estimation without any additional manual labelling and training.
arXiv Detail & Related papers (2020-03-10T14:28:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.