Related papers: Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation

Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation

URL: http://arxiv.org/abs/2307.00536v2
Date: Sun, 17 Sep 2023 09:01:52 GMT
Title: Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation
Authors: Meng Lan, Fu Rong, Zuchao Li, Wei Yu, Lefei Zhang
Abstract summary: We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS. Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder. A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
Score: 44.952526831843386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring video object segmentation (RVOS) aims to segment the target object in a video sequence described by a language expression. Typical multimodal Transformer based RVOS approaches process video sequence in a frame-independent manner to reduce the high computational cost, which however restricts the performance due to the lack of inter-frame interaction for temporal coherence modeling and spatio-temporal representation learning of the referred object. Besides, the absence of sufficient cross-modal interactions results in weak correlation between the visual and linguistic features, which increases the difficulty of decoding the target information and limits the performance of the model. In this paper, we propose a bidirectional correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS. Specifically, we design a lightweight and plug-and-play inter-frame interaction module in the Transformer decoder to efficiently learn the spatio-temporal features of the referred object, so as to decode the object information in the video sequence more precisely and generate more accurate segmentation results. Moreover, a bidirectional vision-language interaction module is implemented before the multimodal Transformer to enhance the correlation between the visual and linguistic features, thus facilitating the language queries to decode more precise object information from visual features and ultimately improving the segmentation performance. Extensive experimental results on four benchmarks validate the superiority of our BIFIT over state-of-the-art methods and the effectiveness of our proposed modules.

Related papers

Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders. We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z)
InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios. Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
Efficient Unsupervised Video Object Segmentation Network Based on Motion Guidance [1.5736899098702974]
This paper proposes a video object segmentation network based on motion guidance. The model comprises a dual-stream network, motion guidance module, and multi-scale progressive fusion module. The experimental results prove the superior performance of the proposed method.
arXiv Detail & Related papers (2022-11-10T06:13:23Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically. A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z)
SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [24.884078497381633]
We introduce a Transformer-based approach to video object segmentation (VOS) Our attention-based approach allows a model to learn to attend over a history features of multiple frames. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness compared with the state of the art.
arXiv Detail & Related papers (2021-01-21T20:06:12Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.