MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image
Segmentation
- URL: http://arxiv.org/abs/2111.10747v1
- Date: Sun, 21 Nov 2021 05:54:17 GMT
- Title: MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image
Segmentation
- Authors: Zizhang Li, Mengmeng Wang, Jianbiao Mei, Yong Liu
- Abstract summary: MaIL is a more concise encoder-decoder pipeline with a Mask-Image-Language trimodal encoder.
MaIL unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder.
For the first time, we propose to introduce instance masks as an additional modality, which explicitly intensifies instance-level features.
- Score: 13.311777431243296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring image segmentation is a typical multi-modal task, which aims at
generating a binary mask for referent described in given language expressions.
Prior arts adopt a bimodal solution, taking images and languages as two
modalities within an encoder-fusion-decoder pipeline. However, this pipeline is
sub-optimal for the target task for two reasons. First, they only fuse
high-level features produced by uni-modal encoders separately, which hinders
sufficient cross-modal learning. Second, the uni-modal encoders are pre-trained
independently, which brings inconsistency between pre-trained uni-modal tasks
and the target multi-modal task. Besides, this pipeline often ignores or makes
little use of intuitively beneficial instance-level features. To relieve these
problems, we propose MaIL, which is a more concise encoder-decoder pipeline
with a Mask-Image-Language trimodal encoder. Specifically, MaIL unifies
uni-modal feature extractors and their fusion model into a deep modality
interaction encoder, facilitating sufficient feature interaction across
different modalities. Meanwhile, MaIL directly avoids the second limitation
since no uni-modal encoders are needed anymore. Moreover, for the first time,
we propose to introduce instance masks as an additional modality, which
explicitly intensifies instance-level features and promotes finer segmentation
results. The proposed MaIL set a new state-of-the-art on all frequently-used
referring image segmentation datasets, including RefCOCO, RefCOCO+, and G-Ref,
with significant gains, 3%-10% against previous best methods. Code will be
released soon.
Related papers
- A Simple Baseline with Single-encoder for Referring Image Segmentation [14.461024566536478]
We present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention.
Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets.
arXiv Detail & Related papers (2024-08-28T04:14:01Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - Multi-Modal Mutual Attention and Iterative Interaction for Referring
Image Segmentation [49.6153714376745]
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression.
We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
arXiv Detail & Related papers (2023-05-24T16:26:05Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Divided Attention: Unsupervised Multi-Object Discovery with Contextually
Separated Slots [78.23772771485635]
We introduce a method to segment the visual field into independently moving regions, trained with no ground truth or supervision.
It consists of an adversarial conditional encoder-decoder architecture based on Slot Attention.
arXiv Detail & Related papers (2023-04-04T00:26:13Z) - PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D
Object Detection [26.03582038710992]
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities.
In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world.
We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
arXiv Detail & Related papers (2023-03-14T17:58:03Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - CMF: Cascaded Multi-model Fusion for Referring Image Segmentation [24.942658173937563]
We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression.
We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel.
Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
arXiv Detail & Related papers (2021-06-16T08:18:39Z) - VLM: Task-agnostic Video-Language Model Pre-training for Video
Understanding [78.28397557433544]
We present a task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.
Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.
arXiv Detail & Related papers (2021-05-20T19:13:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.