Comprehensive Multi-Modal Interactions for Referring Image Segmentation
- URL: http://arxiv.org/abs/2104.10412v1
- Date: Wed, 21 Apr 2021 08:45:09 GMT
- Title: Comprehensive Multi-Modal Interactions for Referring Image Segmentation
- Authors: Kanishk Jain, Vineet Gandhi
- Abstract summary: We investigate Referring Image (RIS), which outputs a segmentation map corresponding to the given natural language description.
To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains.
We propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task.
- Score: 7.064383217512461
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate Referring Image Segmentation (RIS), which outputs a
segmentation map corresponding to the given natural language description. To
solve RIS efficiently, we need to understand each word's relationship with
other words, each region in the image to other regions, and cross-modal
alignment between linguistic and visual domains. Recent methods model these
three types of interactions sequentially. We argue that such a modular approach
limits these methods' performance, and joint simultaneous reasoning can help
resolve ambiguities. To this end, we propose a Joint Reasoning (JRM) module and
a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task.
JRM effectively models the referent's multi-modal context by jointly reasoning
over visual and linguistic modalities (performing word-word, image
region-region, word-region interactions in a single module). CMMLF module
further refines the segmentation masks by exchanging contextual information
across visual hierarchy through linguistic features acting as a bridge. We
present thorough ablation studies and validate our approach's performance on
four benchmark datasets, and show that the proposed method outperforms the
existing state-of-the-art methods on all four datasets by significant margins.
Related papers
- Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - RISAM: Referring Image Segmentation via Mutual-Aware Attention Features [13.64992652002458]
Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt.
Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding.
We propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism.
arXiv Detail & Related papers (2023-11-27T11:24:25Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation.
We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z) - CMF: Cascaded Multi-model Fusion for Referring Image Segmentation [24.942658173937563]
We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression.
We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel.
Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
arXiv Detail & Related papers (2021-06-16T08:18:39Z) - Cross-Modal Progressive Comprehension for Referring Segmentation [89.58118962086851]
Cross-Modal Progressive (CMPC) scheme to effectively mimic human behaviors.
For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression.
For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning.
arXiv Detail & Related papers (2021-05-15T08:55:51Z) - Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z) - Linguistic Structure Guided Context Modeling for Referring Image
Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction.
Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.