Related papers: Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

URL: http://arxiv.org/abs/2307.08597v1
Date: Mon, 17 Jul 2023 16:07:07 GMT
Title: Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions
Authors: Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka and Komei Sugiura
Abstract summary: We develop a model that comprehends a natural language instruction and generates a segmentation mask for the target everyday object. We build a new dataset based on the well-known Matterport3D and REVERIE datasets. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction mechanism and extend diffusion probabilistic models to handle crossmodal features. To validate our model, we built a new dataset based on the well-known Matterport3D and REVERIE datasets. This dataset consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects, in addition to pixel-wise segmentation masks. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.

Related papers

LlamaSeg: Image Segmentation via Autoregressive Mask Generation [46.17509085054758]
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions.<n>We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs.
arXiv Detail & Related papers (2025-05-26T02:22:41Z)
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model [6.641903410779405]
We propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the next-token-prediction paradigm. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.
arXiv Detail & Related papers (2025-03-17T10:29:08Z)
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation [72.12250272218792]
We propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. We integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks. The generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings.
arXiv Detail & Related papers (2024-11-20T12:02:12Z)
3D-GRES: Generalized 3D Referring Expression Segmentation [77.10044505645064]
3D Referring Expression (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. Generalized 3D Referring Expression (3D-GRES) extends the capability to segment any number of instances based on natural language instructions.
arXiv Detail & Related papers (2024-07-30T08:59:05Z)
Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models [0.8749675983608172]
We consider the task of generating segmentation masks for the target object from an object manipulation instruction. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions.
arXiv Detail & Related papers (2024-07-01T05:48:48Z)
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model [108.35777542298224]
Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks. We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes.
arXiv Detail & Related papers (2024-05-27T17:59:41Z)
DFormer: Diffusion-guided Transformer for Universal Image Segmentation [86.73405604947459]
The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model. At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks. Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val 2017 set.
arXiv Detail & Related papers (2023-06-06T06:33:32Z)
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation [49.6153714376745]
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
arXiv Detail & Related papers (2023-05-24T16:26:05Z)
Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation. Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z)
Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask. We present a "Then-Then-Segment" scheme to tackle these problems. Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)
Multi-task deep learning for image segmentation using recursive approximation tasks [5.735162284272276]
Deep neural networks for segmentation usually require a massive amount of pixel-level labels which are manually expensive to create. In this work, we develop a multi-task learning method to relax this constraint. The network is trained on an extremely small amount of precisely segmented images and a large set of coarse labels.
arXiv Detail & Related papers (2020-05-26T21:35:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.