Multimodal Diffusion Segmentation Model for Object Segmentation from
Manipulation Instructions
- URL: http://arxiv.org/abs/2307.08597v1
- Date: Mon, 17 Jul 2023 16:07:07 GMT
- Title: Multimodal Diffusion Segmentation Model for Object Segmentation from
Manipulation Instructions
- Authors: Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka and Komei Sugiura
- Abstract summary: We develop a model that comprehends a natural language instruction and generates a segmentation mask for the target everyday object.
We build a new dataset based on the well-known Matterport3D and REVERIE datasets.
The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we aim to develop a model that comprehends a natural language
instruction (e.g., "Go to the living room and get the nearest pillow to the
radio art on the wall") and generates a segmentation mask for the target
everyday object. The task is challenging because it requires (1) the
understanding of the referring expressions for multiple objects in the
instruction, (2) the prediction of the target phrase of the sentence among the
multiple phrases, and (3) the generation of pixel-wise segmentation masks
rather than bounding boxes. Studies have been conducted on languagebased
segmentation methods; however, they sometimes mask irrelevant regions for
complex sentences. In this paper, we propose the Multimodal Diffusion
Segmentation Model (MDSM), which generates a mask in the first stage and
refines it in the second stage. We introduce a crossmodal parallel feature
extraction mechanism and extend diffusion probabilistic models to handle
crossmodal features. To validate our model, we built a new dataset based on the
well-known Matterport3D and REVERIE datasets. This dataset consists of
instructions with complex referring expressions accompanied by real indoor
environmental images that feature various target objects, in addition to
pixel-wise segmentation masks. The performance of MDSM surpassed that of the
baseline method by a large margin of +10.13 mean IoU.
Related papers
- 3D-GRES: Generalized 3D Referring Expression Segmentation [77.10044505645064]
3D Referring Expression (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description.
Generalized 3D Referring Expression (3D-GRES) extends the capability to segment any number of instances based on natural language instructions.
arXiv Detail & Related papers (2024-07-30T08:59:05Z) - Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models [0.8749675983608172]
We consider the task of generating segmentation masks for the target object from an object manipulation instruction.
In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions.
arXiv Detail & Related papers (2024-07-01T05:48:48Z) - DFormer: Diffusion-guided Transformer for Universal Image Segmentation [86.73405604947459]
The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model.
At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks.
Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val 2017 set.
arXiv Detail & Related papers (2023-06-06T06:33:32Z) - Multi-Modal Mutual Attention and Iterative Interaction for Referring
Image Segmentation [49.6153714376745]
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression.
We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
arXiv Detail & Related papers (2023-05-24T16:26:05Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - Multi-task deep learning for image segmentation using recursive
approximation tasks [5.735162284272276]
Deep neural networks for segmentation usually require a massive amount of pixel-level labels which are manually expensive to create.
In this work, we develop a multi-task learning method to relax this constraint.
The network is trained on an extremely small amount of precisely segmented images and a large set of coarse labels.
arXiv Detail & Related papers (2020-05-26T21:35:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.