ARST: Auto-Regressive Surgical Transformer for Phase Recognition from
Laparoscopic Videos
- URL: http://arxiv.org/abs/2209.01148v1
- Date: Fri, 2 Sep 2022 16:05:39 GMT
- Title: ARST: Auto-Regressive Surgical Transformer for Phase Recognition from
Laparoscopic Videos
- Authors: Xiaoyang Zou, Wenyong Liu, Junchen Wang, Rong Tao and Guoyan Zheng
- Abstract summary: Transformer, originally proposed for sequential data modeling in natural language processing, has been successfully applied to surgical phase recognition.
In this work, an Auto-Regressive Surgical Transformer, referred as ARST, is first proposed for on-line surgical phase recognition from laparoscopic videos.
- Score: 2.973286445527318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Phase recognition plays an essential role for surgical workflow analysis in
computer assisted intervention. Transformer, originally proposed for sequential
data modeling in natural language processing, has been successfully applied to
surgical phase recognition. Existing works based on transformer mainly focus on
modeling attention dependency, without introducing auto-regression. In this
work, an Auto-Regressive Surgical Transformer, referred as ARST, is first
proposed for on-line surgical phase recognition from laparoscopic videos,
modeling the inter-phase correlation implicitly by conditional probability
distribution. To reduce inference bias and to enhance phase consistency, we
further develop a consistency constraint inference strategy based on
auto-regression. We conduct comprehensive validations on a well-known public
dataset Cholec80. Experimental results show that our method outperforms the
state-of-the-art methods both quantitatively and qualitatively, and achieves an
inference rate of 66 frames per second (fps).
Related papers
- SurgPLAN: Surgical Phase Localization Network for Phase Recognition [14.857715124466594]
We propose a Surgical Phase LocAlization Network, named SurgPLAN, to facilitate a more accurate and stable surgical phase recognition.
We first devise a Pyramid SlowFast (PSF) architecture to serve as the visual backbone to capture multi-scale spatial and temporal features by two branches with different frame sampling rates.
arXiv Detail & Related papers (2023-11-16T15:39:01Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - LoViT: Long Video Transformer for Surgical Phase Recognition [59.06812739441785]
We present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information.
Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently.
arXiv Detail & Related papers (2023-05-15T20:06:14Z) - Surgical Phase Recognition in Laparoscopic Cholecystectomy [57.929132269036245]
We propose a Transformer-based method that utilizes calibrated confidence scores for a 2-stage inference pipeline.
Our method outperforms the baseline model on the Cholec80 dataset, and can be applied to a variety of action segmentation methods.
arXiv Detail & Related papers (2022-06-14T22:55:31Z) - Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid
Embedding Aggregation Transformer [57.18185972461453]
We introduce for the first time in surgical workflow analysis Transformer to reconsider the ignored complementary effects of spatial and temporal features for accurate phase recognition.
Our framework is lightweight and processes the hybrid embeddings in parallel to achieve a high inference speed.
arXiv Detail & Related papers (2021-03-17T15:12:55Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - TeCNO: Surgical Phase Recognition with Multi-Stage Temporal
Convolutional Networks [43.95869213955351]
We propose a Multi-Stage Temporal Convolutional Network (MS-TCN) that performs hierarchical prediction refinement for surgical phase recognition.
Our method is thoroughly evaluated on two datasets of laparoscopic cholecystectomy videos with and without the use of additional surgical tool information.
arXiv Detail & Related papers (2020-03-24T10:12:30Z) - Automatic Data Augmentation via Deep Reinforcement Learning for
Effective Kidney Tumor Segmentation [57.78765460295249]
We develop a novel automatic learning-based data augmentation method for medical image segmentation.
In our method, we innovatively combine the data augmentation module and the subsequent segmentation module in an end-to-end training manner with a consistent loss.
We extensively evaluated our method on CT kidney tumor segmentation which validated the promising results of our method.
arXiv Detail & Related papers (2020-02-22T14:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.