Related papers: Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

URL: http://arxiv.org/abs/2407.00985v1
Date: Mon, 1 Jul 2024 05:48:48 GMT
Title: Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models
Authors: Takayuki Nishimura, Katsuyuki Kuyo, Motonari Kambara, Komei Sugiura,
Abstract summary: We consider the task of generating segmentation masks for the target object from an object manipulation instruction. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions.
Score: 0.8749675983608172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.

Related papers

LlamaSeg: Image Segmentation via Autoregressive Mask Generation [46.17509085054758]
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions.<n>We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs.
arXiv Detail & Related papers (2025-05-26T02:22:41Z)
Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions [0.0]
We develop a model that comprehends a natural language instruction and generates a segmentation mask for the target everyday object. We build a new dataset based on the well-known Matterport3D and REVERIE datasets. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
arXiv Detail & Related papers (2023-07-17T16:07:07Z)
EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision [35.232051353760035]
We introduce Equivariant Neural Field Expectation Maximization (EFEM) to segment objects in 3D scenes without annotations or training on scenes. First, we introduce equivariant shape representations to this problem to eliminate the complexity induced by the variation in object configuration. Second, we propose a novel EM algorithm that can iteratively refine segmentation masks using the equivariant shape prior.
arXiv Detail & Related papers (2023-03-27T17:59:29Z)
Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models [6.408114351192012]
We present a novel method that enables the generation of general foreground-background segmentation models from simple textual descriptions. We show results on the task of segmenting four different objects (humans, dogs, cars, birds) and a use case scenario in medical image analysis.
arXiv Detail & Related papers (2022-12-29T13:51:54Z)
Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation. Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z)
PointInst3D: Segmenting 3D Instances by Points [136.7261709896713]
We propose a fully-convolutional 3D point cloud instance segmentation method that works in a per-point prediction fashion. We find the key to its success is assigning a suitable target to each sampled point. Our approach achieves promising results on both ScanNet and S3DIS benchmarks.
arXiv Detail & Related papers (2022-04-25T02:41:46Z)
SODAR: Segmenting Objects by DynamicallyAggregating Neighboring Mask Representations [90.8752454643737]
Recent state-of-the-art one-stage instance segmentation model SOLO divides the input image into a grid and directly predicts per grid cell object masks with fully-convolutional networks. We observe SOLO generates similar masks for an object at nearby grid cells, and these neighboring predictions can complement each other as some may better segment certain object part. Motivated by the observed gap, we develop a novel learning-based aggregation method that improves upon SOLO by leveraging the rich neighboring information.
arXiv Detail & Related papers (2022-02-15T13:53:03Z)
Learning Class-Agnostic Pseudo Mask Generation for Box-Supervised Semantic Segmentation [156.9155100983315]
We seek for a more accurate learning-based class-agnostic pseudo mask generator tailored to box-supervised semantic segmentation. Our method can further close the performance gap between box-supervised and fully-supervised models.
arXiv Detail & Related papers (2021-03-09T14:54:54Z)
Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting. We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)
GridMask Data Augmentation [76.79300104795966]
We propose a novel data augmentation method GridMask' in this paper. It utilizes information removal to achieve state-of-the-art results in a variety of computer vision tasks.
arXiv Detail & Related papers (2020-01-13T07:27:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.