Siamese Network with Interactive Transformer for Video Object
Segmentation
- URL: http://arxiv.org/abs/2112.13983v1
- Date: Tue, 28 Dec 2021 03:38:17 GMT
- Title: Siamese Network with Interactive Transformer for Video Object
Segmentation
- Authors: Meng Lan, Jing Zhang, Fengxiang He, Lefei Zhang
- Abstract summary: We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
- Score: 34.202137199782804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-supervised video object segmentation (VOS) refers to segmenting the
target object in remaining frames given its annotation in the first frame,
which has been actively studied in recent years. The key challenge lies in
finding effective ways to exploit the spatio-temporal context of past frames to
help learn discriminative target representation of current frame. In this
paper, we propose a novel Siamese network with a specifically designed
interactive transformer, called SITVOS, to enable effective context propagation
from historical to current frames. Technically, we use the transformer encoder
and decoder to handle the past frames and current frame separately, i.e., the
encoder encodes robust spatio-temporal context of target object from the past
frames, while the decoder takes the feature embedding of current frame as the
query to retrieve the target from the encoder output. To further enhance the
target representation, a feature interaction module (FIM) is devised to promote
the information flow between the encoder and decoder. Moreover, we employ the
Siamese architecture to extract backbone features of both past and current
frames, which enables feature reuse and is more efficient than existing
methods. Experimental results on three challenging benchmarks validate the
superiority of SITVOS over state-of-the-art methods.
Related papers
- Concatenated Masked Autoencoders as Spatial-Temporal Learner [6.475592804311682]
We introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
We propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets.
arXiv Detail & Related papers (2023-11-02T03:08:26Z) - Redundancy-aware Transformer for Video Question Answering [71.98116071679065]
We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner.
To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames.
As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
arXiv Detail & Related papers (2023-08-07T03:16:24Z) - Online Video Instance Segmentation via Robust Context Fusion [36.376900904288966]
Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences.
Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
arXiv Detail & Related papers (2022-07-12T15:04:50Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.