Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation
- URL: http://arxiv.org/abs/2203.15251v1
- Date: Tue, 29 Mar 2022 05:52:23 GMT
- Title: Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation
- Authors: Yueming Jin, Yang Yu, Cheng Chen, Zixu Zhao, Pheng-Ann Heng, Danail
Stoyanov
- Abstract summary: We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
- Score: 58.74791043631219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic surgical scene segmentation is fundamental for facilitating
cognitive intelligence in the modern operating theatre. Previous works rely on
conventional aggregation modules (e.g., dilated convolution, convolutional
LSTM), which only make use of the local context. In this paper, we propose a
novel framework STswinCL that explores the complementary intra- and inter-video
relations to boost segmentation performance, by progressively capturing the
global context. We firstly develop a hierarchy Transformer to capture
intra-video relation that includes richer spatial and temporal cues from
neighbor pixels and previous frames. A joint space-time window shift scheme is
proposed to efficiently aggregate these two cues into each pixel embedding.
Then, we explore inter-video relation via pixel-to-pixel contrastive learning,
which well structures the global embedding space. A multi-source contrast
training objective is developed to group the pixel embeddings across videos
with the ground-truth guidance, which is crucial for learning the global
property of the whole data. We extensively validate our approach on two public
surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which
consistently exceeds previous state-of-the-art approaches. Code will be
available at https://github.com/YuemingJin/STswinCL.
Related papers
- Global Motion Understanding in Large-Scale Video Object Segmentation [0.499320937849508]
We show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object (VOS) under complex circumstances.
Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object.
We present WarpFormer, an architecture for semi-supervised Video Object that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching.
arXiv Detail & Related papers (2024-05-11T15:09:22Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation [10.789826145990016]
This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
arXiv Detail & Related papers (2023-02-22T12:09:39Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - Adaptive Intermediate Representations for Video Understanding [50.64187463941215]
We introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding.
We propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task.
We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art.
arXiv Detail & Related papers (2021-04-14T21:37:23Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.