DACAT: Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase Recognition
- URL: http://arxiv.org/abs/2409.06217v1
- Date: Tue, 10 Sep 2024 04:58:48 GMT
- Title: DACAT: Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase Recognition
- Authors: Kaixiang Yang, Qiang Li, Zhiwei Wang,
- Abstract summary: Surgical phase recognition is a crucial requirement in laparoscopic surgery, enabling various clinical applications like surgical risk forecasting.
We propose DACAT, a novel dual-stream model that adaptively learns clip-aware context information to enhance the temporal relationship.
- Score: 9.560659134295866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical phase recognition has become a crucial requirement in laparoscopic surgery, enabling various clinical applications like surgical risk forecasting. Current methods typically identify the surgical phase using individual frame-wise embeddings as the fundamental unit for time modeling. However, this approach is overly sensitive to current observations, often resulting in discontinuous and erroneous predictions within a complete surgical phase. In this paper, we propose DACAT, a novel dual-stream model that adaptively learns clip-aware context information to enhance the temporal relationship. In one stream, DACAT pretrains a frame encoder, caching all historical frame-wise features. In the other stream, DACAT fine-tunes a new frame encoder to extract the frame-wise feature at the current moment. Additionally, a max clip-response read-out (Max-R) module is introduced to bridge the two streams by using the current frame-wise feature to adaptively fetch the most relevant past clip from the feature cache. The clip-aware context feature is then encoded via cross-attention between the current frame and its fetched adaptive clip, and further utilized to enhance the time modeling for accurate online surgical phase recognition. The benchmark results on three public datasets, i.e., Cholec80, M2CAI16, and AutoLaparo, demonstrate the superiority of our proposed DACAT over existing state-of-the-art methods, with improvements in Jaccard scores of at least 4.5%, 4.6%, and 2.7%, respectively. Our code and models have been released at https://github.com/kk42yy/DACAT.
Related papers
- GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - A spatio-temporal network for video semantic segmentation in surgical
videos [11.548181453080087]
We propose a novel architecture for modelling temporal relationships in videos.
The proposed model includes a decoder to enable semantic video segmentation.
The proposed decoder can be used on top of any segmentation encoder to improve temporal consistency.
arXiv Detail & Related papers (2023-06-19T16:36:48Z) - SF-TMN: SlowFast Temporal Modeling Network for Surgical Phase
Recognition [0.5669790037378094]
We propose SlowFast Temporal Modeling Network (SF-TMN) for surgical phase recognition.
It can achieve frame-level full video temporal modeling and segment-level full video temporal modeling.
SF-TMN with ASFormer backbone outperforms the state-of-the-art Not End-to-End(TCN) method by 2.6% in accuracy and 7.4% in the Jaccard score.
arXiv Detail & Related papers (2023-06-15T05:04:29Z) - LoViT: Long Video Transformer for Surgical Phase Recognition [59.06812739441785]
We present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information.
Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently.
arXiv Detail & Related papers (2023-05-15T20:06:14Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Retrieval of surgical phase transitions using reinforcement learning [11.130363429095048]
We introduce a novel reinforcement learning formulation for offline phase transition retrieval.
By construction, our model does not produce spurious and noisy phase transitions, but contiguous phase blocks.
We compare our method against the recent top-performing frame-based approaches TeCNO and Trans-SVNet.
arXiv Detail & Related papers (2022-08-01T14:43:15Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.