Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation
- URL: http://arxiv.org/abs/2302.11325v2
- Date: Tue, 4 Jul 2023 15:51:23 GMT
- Title: Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation
- Authors: Chengxi Zeng, Xinyu Yang, David Smithard, Majid Mirmehdi, Alberto M
Gambaruto, Tilo Burghardt
- Abstract summary: This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
- Score: 10.789826145990016
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents a deep learning framework for medical video segmentation.
Convolution neural network (CNN) and transformer-based methods have achieved
great milestones in medical image segmentation tasks due to their incredible
semantic feature encoding and global information comprehension abilities.
However, most existing approaches ignore a salient aspect of medical video data
- the temporal dimension. Our proposed framework explicitly extracts features
from neighbouring frames across the temporal dimension and incorporates them
with a temporal feature blender, which then tokenises the high-level
spatio-temporal feature to form a strong global feature encoded via a Swin
Transformer. The final segmentation results are produced via a UNet-like
encoder-decoder architecture. Our model outperforms other approaches by a
significant margin and improves the segmentation benchmarks on the VFSS2022
dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets
tested. Our studies also show the efficacy of the temporal feature blending
scheme and cross-dataset transferability of learned capabilities. Code and
models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet.
Related papers
- Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation? [3.1777394653936937]
This paper investigates the integration of CNNs and Vision Extended Long Short-Term Memory (Vision-xLSTM) models by introducing a novel approach called UVixLSTM.
The Vision-xLSTM blocks captures temporal and global relationships within the patches extracted from the CNN feature maps.
UVixLSTM exhibits superior performance compared to state-of-the-art networks on the publicly-available dataset.
arXiv Detail & Related papers (2024-06-24T08:01:05Z) - ParaTransCNN: Parallelized TransCNN Encoder for Medical Image
Segmentation [7.955518153976858]
We propose an advanced 2D feature extraction method by combining the convolutional neural network and Transformer architectures.
Our method is shown with better segmentation accuracy, especially on small organs.
arXiv Detail & Related papers (2024-01-27T05:58:36Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS
Instance Segmentation [11.575821326313607]
We propose Video-TransUNet, a deep architecture for segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework.
In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconal architecture with multiple heads.
arXiv Detail & Related papers (2022-08-17T14:28:58Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z) - Unsupervised Learning Consensus Model for Dynamic Texture Videos
Segmentation [12.462608802359936]
We present an effective unsupervised learning consensus model for the segmentation of dynamic texture (ULCM)
In the proposed model, the set of values of the requantized local binary patterns (LBP) histogram around the pixel to be classified are used as features.
Experiments conducted on the challenging SynthDB dataset show that ULCM is significantly faster, easier to code, simple and has limited parameters.
arXiv Detail & Related papers (2020-06-29T16:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.