Related papers: Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

URL: http://arxiv.org/abs/2203.15969v1
Date: Wed, 30 Mar 2022 01:06:13 GMT
Title: Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation
Authors: Guang Feng, Lihe Zhang, Zhiwei Hu, Huchuan Lu
Abstract summary: We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically. A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
Score: 87.49579477873196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.

Related papers

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes. Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation [44.952526831843386]
We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS. Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder. A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
arXiv Detail & Related papers (2023-07-02T10:29:35Z)
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z)
MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer [12.544216587327387]
We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video. We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
arXiv Detail & Related papers (2023-04-12T15:50:19Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames. We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.