Referring Image Segmentation via Cross-Modal Progressive Comprehension
- URL: http://arxiv.org/abs/2010.00514v1
- Date: Thu, 1 Oct 2020 16:02:30 GMT
- Title: Referring Image Segmentation via Cross-Modal Progressive Comprehension
- Authors: Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong
Han, Luoqi Liu, Bo Li
- Abstract summary: Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
- Score: 94.70482302324704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation aims at segmenting the foreground masks of the
entities that can well match the description given in the natural language
expression. Previous approaches tackle this problem using implicit feature
interaction and fusion between visual and linguistic modalities, but usually
fail to explore informative words of the expression to well align features from
the two modalities for accurately identifying the referred entity. In this
paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a
Text-Guided Feature Exchange (TGFE) module to effectively address the
challenging task. Concretely, the CMPC module first employs entity and
attribute words to perceive all the related entities that might be considered
by the expression. Then, the relational words are adopted to highlight the
correct entity as well as suppress other irrelevant ones by multimodal graph
reasoning. In addition to the CMPC module, we further leverage a simple yet
effective TGFE module to integrate the reasoned multimodal features from
different levels with the guidance of textual information. In this way,
features from multi-levels could communicate with each other and be refined
based on the textual context. We conduct extensive experiments on four popular
referring segmentation benchmarks and achieve new state-of-the-art
performances.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Cross-Modal Progressive Comprehension for Referring Segmentation [89.58118962086851]
Cross-Modal Progressive (CMPC) scheme to effectively mimic human behaviors.
For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression.
For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning.
arXiv Detail & Related papers (2021-05-15T08:55:51Z) - Comprehensive Multi-Modal Interactions for Referring Image Segmentation [7.064383217512461]
We investigate Referring Image (RIS), which outputs a segmentation map corresponding to the given natural language description.
To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains.
We propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task.
arXiv Detail & Related papers (2021-04-21T08:45:09Z) - Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.