Multi-Modal Mutual Attention and Iterative Interaction for Referring
Image Segmentation
- URL: http://arxiv.org/abs/2305.15302v1
- Date: Wed, 24 May 2023 16:26:05 GMT
- Title: Multi-Modal Mutual Attention and Iterative Interaction for Referring
Image Segmentation
- Authors: Chang Liu, Henghui Ding, Yulun Zhang, Xudong Jiang
- Abstract summary: We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression.
We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
- Score: 49.6153714376745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of referring image segmentation that aims to generate
a mask for the object specified by a natural language expression. Many recent
works utilize Transformer to extract features for the target object by
aggregating the attended visual regions. However, the generic attention
mechanism in Transformer only uses the language input for attention weight
calculation, which does not explicitly fuse language features in its output.
Thus, its output feature is dominated by vision information, which limits the
model to comprehensively understand the multi-modal information, and brings
uncertainty for the subsequent mask decoder to extract the output mask. To
address this issue, we propose Multi-Modal Mutual Attention ($\mathrm{M^3Att}$)
and Multi-Modal Mutual Decoder ($\mathrm{M^3Dec}$) that better fuse information
from the two input modalities. Based on {$\mathrm{M^3Dec}$}, we further propose
Iterative Multi-modal Interaction ($\mathrm{IMI}$) to allow continuous and
in-depth interactions between language and vision features. Furthermore, we
introduce Language Feature Reconstruction ($\mathrm{LFR}$) to prevent the
language information from being lost or distorted in the extracted feature.
Extensive experiments show that our proposed approach significantly improves
the baseline and outperforms state-of-the-art referring image segmentation
methods on RefCOCO series datasets consistently.
Related papers
- ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling [80.85164509232261]
We propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer.
To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM)
Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region.
arXiv Detail & Related papers (2024-10-10T15:18:19Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Multimodal Diffusion Segmentation Model for Object Segmentation from
Manipulation Instructions [0.0]
We develop a model that comprehends a natural language instruction and generates a segmentation mask for the target everyday object.
We build a new dataset based on the well-known Matterport3D and REVERIE datasets.
The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
arXiv Detail & Related papers (2023-07-17T16:07:07Z) - MMNet: Multi-Mask Network for Referring Image Segmentation [6.462622145673872]
We propose an end-to-end Multi-Mask Network for referring image segmentation(MMNet)
We first combine picture and language then employ an attention mechanism to generate multiple queries that represent different aspects of the language expression.
The final result is obtained through the weighted sum of all masks, which greatly reduces the randomness of the language expression.
arXiv Detail & Related papers (2023-05-24T10:02:27Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - AISFormer: Amodal Instance Segmentation with Transformer [9.042737643989561]
Amodal Instance (AIS) aims to segment the region of both visible and possible occluded parts of an object instance.
We present AISFormer, an AIS framework, with a Transformer-based mask head.
arXiv Detail & Related papers (2022-10-12T15:42:40Z) - MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image
Segmentation [13.311777431243296]
MaIL is a more concise encoder-decoder pipeline with a Mask-Image-Language trimodal encoder.
MaIL unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder.
For the first time, we propose to introduce instance masks as an additional modality, which explicitly intensifies instance-level features.
arXiv Detail & Related papers (2021-11-21T05:54:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.