Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network
- URL: http://arxiv.org/abs/2102.04762v1
- Date: Tue, 9 Feb 2021 11:27:59 GMT
- Title: Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network
- Authors: Linwei Ye, Mrigank Rochan, Zhi Liu, Xiaoqin Zhang and Yang Wang
- Abstract summary: Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
- Score: 27.792054915363106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of referring segmentation in images and videos with
natural language. Given an input image (or video) and a referring expression,
the goal is to segment the entity referred by the expression in the image or
video. In this paper, we propose a cross-modal self-attention (CMSA) module to
utilize fine details of individual words and the input image or video, which
effectively captures the long-range dependencies between linguistic and visual
features. Our model can adaptively focus on informative words in the referring
expression and important regions in the visual input. We further propose a
gated multi-level fusion (GMLF) module to selectively integrate self-attentive
cross-modal features corresponding to different levels of visual features. This
module controls the feature fusion of information flow of features at different
levels with high-level and low-level semantic information related to different
attentive words. Besides, we introduce cross-frame self-attention (CFSA) module
to effectively integrate temporal information in consecutive frames which
extends our method in the case of referring segmentation in videos. Experiments
on benchmark datasets of four referring image datasets and two actor and action
video segmentation datasets consistently demonstrate that our proposed approach
outperforms existing state-of-the-art methods.
Related papers
- CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module.
We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Cross-Modal Progressive Comprehension for Referring Segmentation [89.58118962086851]
Cross-Modal Progressive (CMPC) scheme to effectively mimic human behaviors.
For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression.
For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning.
arXiv Detail & Related papers (2021-05-15T08:55:51Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.