Related papers: Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

URL: http://arxiv.org/abs/2204.02547v1
Date: Wed, 6 Apr 2022 02:42:33 GMT
Title: Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Authors: Wangbo Zhao, Kai Wang, Xiangxiang Chu, Fuzhao Xue, Xinchao Wang, Yang You
Abstract summary: Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
Score: 56.41614987789537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.

Related papers

GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval [12.483734449829235]
GAID is a framework that integrates audio and visual features under textual guidance.<n>DASP injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference.<n>Experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results with notable efficiency gains.
arXiv Detail & Related papers (2025-08-03T10:44:24Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z)
Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation [44.952526831843386]
We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS. Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder. A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
arXiv Detail & Related papers (2023-07-02T10:29:35Z)
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically. A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video. gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features. Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.