ST-MTL: Spatio-Temporal Multitask Learning Model to Predict Scanpath
While Tracking Instruments in Robotic Surgery
- URL: http://arxiv.org/abs/2112.08189v1
- Date: Fri, 10 Dec 2021 15:20:27 GMT
- Title: ST-MTL: Spatio-Temporal Multitask Learning Model to Predict Scanpath
While Tracking Instruments in Robotic Surgery
- Authors: Mobarakol Islam, Vibashan VS, Chwee Ming Lim, Hongliang Ren
- Abstract summary: Learning of the task-oriented attention while tracking instrument holds vast potential in image-guided robotic surgery.
We propose an end-to-end Multi-Task Learning (ST-MTL) model with a shared encoder and Sink-temporal decoders for the real-time surgical instrument segmentation and task-oriented saliency detection.
We tackle the problem with a novel asynchronous-temporal optimization technique by calculating independent gradients for each decoder.
Compared to the state-of-the-art segmentation and saliency methods, our model most outperforms the evaluation metrics and produces an outstanding performance in challenge
- Score: 14.47768738295518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Representation learning of the task-oriented attention while tracking
instrument holds vast potential in image-guided robotic surgery. Incorporating
cognitive ability to automate the camera control enables the surgeon to
concentrate more on dealing with surgical instruments. The objective is to
reduce the operation time and facilitate the surgery for both surgeons and
patients. We propose an end-to-end trainable Spatio-Temporal Multi-Task
Learning (ST-MTL) model with a shared encoder and spatio-temporal decoders for
the real-time surgical instrument segmentation and task-oriented saliency
detection. In the MTL model of shared parameters, optimizing multiple loss
functions into a convergence point is still an open challenge. We tackle the
problem with a novel asynchronous spatio-temporal optimization (ASTO) technique
by calculating independent gradients for each decoder. We also design a
competitive squeeze and excitation unit by casting a skip connection that
retains weak features, excites strong features, and performs dynamic spatial
and channel-wise feature recalibration. To capture better long term
spatio-temporal dependencies, we enhance the long-short term memory (LSTM)
module by concatenating high-level encoder features of consecutive frames. We
also introduce Sinkhorn regularized loss to enhance task-oriented saliency
detection by preserving computational efficiency. We generate the task-aware
saliency maps and scanpath of the instruments on the dataset of the MICCAI 2017
robotic instrument segmentation challenge. Compared to the state-of-the-art
segmentation and saliency methods, our model outperforms most of the evaluation
metrics and produces an outstanding performance in the challenge.
Related papers
- SEDMamba: Enhancing Selective State Space Modelling with Bottleneck Mechanism and Fine-to-Coarse Temporal Fusion for Efficient Error Detection in Robot-Assisted Surgery [7.863539113283565]
We propose a novel hierarchical model named SEDMamba, which incorporates the selective state space model (SSM) into surgical error detection.
SEDMamba enhances selective SSM with a bottleneck mechanism and fine-to-coarse temporal fusion (FCTF) to detect and temporally localize surgical errors in long videos.
Our work also contributes the first-of-its-kind, frame-level, in-vivo surgical error dataset to support error detection in real surgical cases.
arXiv Detail & Related papers (2024-06-22T19:20:35Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - TUNeS: A Temporal U-Net with Self-Attention for Video-based Surgical Phase Recognition [1.5237530964650965]
We propose TUNeS, an efficient and simple temporal model that incorporates self-attention at the core of a convolutional U-Net structure.
In our experiments, almost all temporal models performed better on top of feature extractors that were trained with longer temporal context.
arXiv Detail & Related papers (2023-07-19T14:10:55Z) - Robotic Navigation Autonomy for Subretinal Injection via Intelligent
Real-Time Virtual iOCT Volume Slicing [88.99939660183881]
We propose a framework for autonomous robotic navigation for subretinal injection.
Our method consists of an instrument pose estimation method, an online registration between the robotic and the i OCT system, and trajectory planning tailored for navigation to an injection target.
Our experiments on ex-vivo porcine eyes demonstrate the precision and repeatability of the method.
arXiv Detail & Related papers (2023-01-17T21:41:21Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Real-Time Instrument Segmentation in Robotic Surgery using Auxiliary
Supervised Deep Adversarial Learning [15.490603884631764]
Real-time semantic segmentation of the robotic instruments and tissues is a crucial step in robot-assisted surgery.
We have developed a light-weight cascaded convolutional neural network (CNN) to segment the surgical instruments from high-resolution videos.
We show that our model surpasses existing algorithms for pixel-wise segmentation of surgical instruments in both prediction accuracy and segmentation time of high-resolution videos.
arXiv Detail & Related papers (2020-07-22T10:16:07Z) - Symmetric Dilated Convolution for Surgical Gesture Recognition [10.699258974625073]
We propose a novel temporal convolutional architecture to automatically detect and segment surgical gestures.
We devise our method with a symmetric dilation structure bridged by a self-attention module to encode and decode the long-term temporal patterns.
We validate our approach on a fundamental robotic suturing task from the JIGSAWS dataset.
arXiv Detail & Related papers (2020-07-13T13:34:48Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z) - AP-MTL: Attention Pruned Multi-task Learning Model for Real-time
Instrument Detection and Segmentation in Robot-assisted Surgery [23.33984309289549]
Training a real-time robotic system for the detection and segmentation of high-resolution images provides a challenging problem with the limited computational resource.
We develop a novel end-to-end trainable real-time Multi-Task Learning model with weight-shared encoder and task-aware detection and segmentation decoders.
Our model significantly outperforms state-of-the-art segmentation and detection models, including best-performed models in the challenge.
arXiv Detail & Related papers (2020-03-10T14:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.