LoViT: Long Video Transformer for Surgical Phase Recognition
- URL: http://arxiv.org/abs/2305.08989v3
- Date: Wed, 14 Jun 2023 16:40:08 GMT
- Title: LoViT: Long Video Transformer for Surgical Phase Recognition
- Authors: Yang Liu, Maxence Boels, Luis C. Garcia-Peraza-Herrera, Tom
Vercauteren, Prokar Dasgupta, Alejandro Granados and Sebastien Ourselin
- Abstract summary: We present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information.
Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently.
- Score: 59.06812739441785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online surgical phase recognition plays a significant role towards building
contextual tools that could quantify performance and oversee the execution of
surgical workflows. Current approaches are limited since they train spatial
feature extractors using frame-level supervision that could lead to incorrect
predictions due to similar frames appearing at different phases, and poorly
fuse local and global features due to computational constraints which can
affect the analysis of long videos commonly encountered in surgical
interventions. In this paper, we present a two-stage method, called Long Video
Transformer (LoViT) for fusing short- and long-term temporal information that
combines a temporally-rich spatial feature extractor and a multi-scale temporal
aggregator consisting of two cascaded L-Trans modules based on self-attention,
followed by a G-Informer module based on ProbSparse self-attention for
processing global temporal information. The multi-scale temporal head then
combines local and global features and classifies surgical phases using phase
transition-aware supervision. Our approach outperforms state-of-the-art methods
on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet,
LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy
on Cholec80 and a 3.1 pp improvement on AutoLaparo. Moreover, it achieves a 5.3
pp improvement in phase-level Jaccard on AutoLaparo and a 1.55 pp improvement
on Cholec80. Our results demonstrate the effectiveness of our approach in
achieving state-of-the-art performance of surgical phase recognition on two
datasets of different surgical procedures and temporal sequencing
characteristics whilst introducing mechanisms that cope with long videos.
Related papers
- MuST: Multi-Scale Transformers for Surgical Phase Recognition [40.047145788604716]
Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems.
Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases.
We propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach.
arXiv Detail & Related papers (2024-07-24T15:38:20Z) - Friends Across Time: Multi-Scale Action Segmentation Transformer for
Surgical Phase Recognition [2.10407185597278]
We propose the Multi-Scale Action Transformer (MS-AST) for offline surgical phase recognition and the Multi-Scale Action Causal Transformer (MS-ASCT) for online surgical phase recognition.
Our method can achieve 95.26% and 96.15% accuracy on the Cholec80 dataset for online and offline surgical phase recognition, respectively.
arXiv Detail & Related papers (2024-01-22T01:34:03Z) - SurgPLAN: Surgical Phase Localization Network for Phase Recognition [14.857715124466594]
We propose a Surgical Phase LocAlization Network, named SurgPLAN, to facilitate a more accurate and stable surgical phase recognition.
We first devise a Pyramid SlowFast (PSF) architecture to serve as the visual backbone to capture multi-scale spatial and temporal features by two branches with different frame sampling rates.
arXiv Detail & Related papers (2023-11-16T15:39:01Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid
Embedding Aggregation Transformer [57.18185972461453]
We introduce for the first time in surgical workflow analysis Transformer to reconsider the ignored complementary effects of spatial and temporal features for accurate phase recognition.
Our framework is lightweight and processes the hybrid embeddings in parallel to achieve a high inference speed.
arXiv Detail & Related papers (2021-03-17T15:12:55Z) - Learning Motion Flows for Semi-supervised Instrument Segmentation from
Robotic Surgical Video [64.44583693846751]
We study the semi-supervised instrument segmentation from robotic surgical videos with sparse annotations.
By exploiting generated data pairs, our framework can recover and even enhance temporal consistency of training sequences.
Results show that our method outperforms the state-of-the-art semisupervised methods by a large margin.
arXiv Detail & Related papers (2020-07-06T02:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.