Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for
Referring Video Object Segmentation
- URL: http://arxiv.org/abs/2307.00536v2
- Date: Sun, 17 Sep 2023 09:01:52 GMT
- Title: Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for
Referring Video Object Segmentation
- Authors: Meng Lan, Fu Rong, Zuchao Li, Wei Yu, Lefei Zhang
- Abstract summary: We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder.
A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
- Score: 44.952526831843386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation (RVOS) aims to segment the target object
in a video sequence described by a language expression. Typical multimodal
Transformer based RVOS approaches process video sequence in a frame-independent
manner to reduce the high computational cost, which however restricts the
performance due to the lack of inter-frame interaction for temporal coherence
modeling and spatio-temporal representation learning of the referred object.
Besides, the absence of sufficient cross-modal interactions results in weak
correlation between the visual and linguistic features, which increases the
difficulty of decoding the target information and limits the performance of the
model. In this paper, we propose a bidirectional correlation-driven inter-frame
interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight and plug-and-play inter-frame interaction
module in the Transformer decoder to efficiently learn the spatio-temporal
features of the referred object, so as to decode the object information in the
video sequence more precisely and generate more accurate segmentation results.
Moreover, a bidirectional vision-language interaction module is implemented
before the multimodal Transformer to enhance the correlation between the visual
and linguistic features, thus facilitating the language queries to decode more
precise object information from visual features and ultimately improving the
segmentation performance. Extensive experimental results on four benchmarks
validate the superiority of our BIFIT over state-of-the-art methods and the
effectiveness of our proposed modules.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Efficient Unsupervised Video Object Segmentation Network Based on Motion
Guidance [1.5736899098702974]
This paper proposes a video object segmentation network based on motion guidance.
The model comprises a dual-stream network, motion guidance module, and multi-scale progressive fusion module.
The experimental results prove the superior performance of the proposed method.
arXiv Detail & Related papers (2022-11-10T06:13:23Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [24.884078497381633]
We introduce a Transformer-based approach to video object segmentation (VOS)
Our attention-based approach allows a model to learn to attend over a history features of multiple frames.
Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness compared with the state of the art.
arXiv Detail & Related papers (2021-01-21T20:06:12Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.