Cross-Modal Progressive Comprehension for Referring Segmentation
- URL: http://arxiv.org/abs/2105.07175v1
- Date: Sat, 15 May 2021 08:55:51 GMT
- Title: Cross-Modal Progressive Comprehension for Referring Segmentation
- Authors: Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, Guanbin Li
- Abstract summary: Cross-Modal Progressive (CMPC) scheme to effectively mimic human behaviors.
For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression.
For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning.
- Score: 89.58118962086851
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a natural language expression and an image/video, the goal of referring
segmentation is to produce the pixel-level masks of the entities described by
the subject of the expression. Previous approaches tackle this problem by
implicit feature interaction and fusion between visual and linguistic
modalities in a one-stage manner. However, human tends to solve the referring
problem in a progressive manner based on informative words in the expression,
i.e., first roughly locating candidate entities and then distinguishing the
target one. In this paper, we propose a Cross-Modal Progressive Comprehension
(CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I
(Image) module and a CMPC-V (Video) module to improve referring image and video
segmentation models. For image data, our CMPC-I module first employs entity and
attribute words to perceive all the related entities that might be considered
by the expression. Then, the relational words are adopted to highlight the
target entity as well as suppress other irrelevant ones by spatial graph
reasoning. For video data, our CMPC-V module further exploits action words
based on CMPC-I to highlight the correct entity matched with the action cues by
temporal graph reasoning. In addition to the CMPC, we also introduce a simple
yet effective Text-Guided Feature Exchange (TGFE) module to integrate the
reasoned multimodal features corresponding to different levels in the visual
backbone under the guidance of textual information. In this way, multi-level
features can communicate with each other and be mutually refined based on the
textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or
video version referring segmentation frameworks and our frameworks achieve new
state-of-the-art performances on four referring image segmentation benchmarks
and three referring video segmentation benchmarks respectively.
Related papers
- Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles.
Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information.
We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.