Mean Shift Mask Transformer for Unseen Object Instance Segmentation
- URL: http://arxiv.org/abs/2211.11679v3
- Date: Thu, 21 Sep 2023 23:04:42 GMT
- Title: Mean Shift Mask Transformer for Unseen Object Instance Segmentation
- Authors: Yangxiao Lu, Yuqiao Chen, Nicholas Ruozzi, Yu Xiang
- Abstract summary: Mean Shift Mask Transformer (MSMFormer) is a new transformer architecture that simulates the von Mises-Fisher (vMF) mean shift clustering algorithm.
Our experiments show that MSMFormer achieves competitive performance compared to state-of-the-art methods for unseen object instance segmentation.
- Score: 12.371855276852195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Segmenting unseen objects from images is a critical perception skill that a
robot needs to acquire. In robot manipulation, it can facilitate a robot to
grasp and manipulate unseen objects. Mean shift clustering is a widely used
method for image segmentation tasks. However, the traditional mean shift
clustering algorithm is not differentiable, making it difficult to integrate it
into an end-to-end neural network training framework. In this work, we propose
the Mean Shift Mask Transformer (MSMFormer), a new transformer architecture
that simulates the von Mises-Fisher (vMF) mean shift clustering algorithm,
allowing for the joint training and inference of both the feature extractor and
the clustering. Its central component is a hypersphere attention mechanism,
which updates object queries on a hypersphere. To illustrate the effectiveness
of our method, we apply MSMFormer to unseen object instance segmentation. Our
experiments show that MSMFormer achieves competitive performance compared to
state-of-the-art methods for unseen object instance segmentation. The project
page, appendix, video, and code are available at
https://irvlutd.github.io/MSMFormer
Related papers
- LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion [79.22197702626542]
This paper introduces a framework that explores amodal segmentation for robotic grasping in cluttered scenes.
We propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net)
The results on different datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-06T14:50:48Z) - Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors [30.579707929061026]
Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting.
We tackle the problem by framing it as a dense semantic part correspondence task.
Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object.
arXiv Detail & Related papers (2024-03-21T16:26:19Z) - HGFormer: Hierarchical Grouping Transformer for Domain Generalized
Semantic Segmentation [113.6560373226501]
This work studies semantic segmentation under the domain generalization setting.
We propose a novel hierarchical grouping transformer (HGFormer) to explicitly group pixels to form part-level masks and then whole-level masks.
Experiments show that HGFormer yields more robust semantic segmentation results than per-pixel classification methods and flat grouping transformers.
arXiv Detail & Related papers (2023-05-22T13:33:41Z) - Self-Supervised Instance Segmentation by Grasping [84.2469669256257]
We learn a grasp segmentation model to segment the grasped object from before and after grasp images.
Using the segmented objects, we can "cut" objects from their original scenes and "paste" them into new scenes to generate instance supervision.
We show that our grasp segmentation model provides a 5x error reduction when segmenting grasped objects compared with traditional image subtraction approaches.
arXiv Detail & Related papers (2023-05-10T16:51:36Z) - SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance
Segmentation [22.930296667684125]
We propose a new box-supervised instance segmentation approach by developing a Semantic-aware Instance Mask (SIM) generation paradigm.
Considering that the semantic-aware prototypes cannot distinguish different instances of the same semantics, we propose a self-correction mechanism.
Extensive experimental results demonstrate the superiority of our proposed SIM approach over other state-of-the-art methods.
arXiv Detail & Related papers (2023-03-14T05:59:25Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - SOIT: Segmenting Objects with Instance-Aware Transformers [16.234574932216855]
This paper presents an end-to-end instance segmentation framework, termed SOIT, that Segments Objects with Instance-aware Transformers.
Inspired by DETR citecarion 2020end, our method views instance segmentation as a direct set prediction problem.
Experimental results on the MS COCO dataset demonstrate that SOIT outperforms state-of-the-art instance segmentation approaches significantly.
arXiv Detail & Related papers (2021-12-21T08:23:22Z) - RICE: Refining Instance Masks in Cluttered Environments with Graph
Neural Networks [53.15260967235835]
We propose a novel framework that refines the output of such methods by utilizing a graph-based representation of instance masks.
We train deep networks capable of sampling smart perturbations to the segmentations, and a graph neural network, which can encode relations between objects, to evaluate the segmentations.
We demonstrate an application that uses uncertainty estimates generated by our method to guide a manipulator, leading to efficient understanding of cluttered scenes.
arXiv Detail & Related papers (2021-06-29T20:29:29Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Fast Object Segmentation Learning with Kernel-based Methods for Robotics [21.48920421574167]
Object segmentation is a key component in the visual system of a robot that performs tasks like grasping and object manipulation.
We propose a novel architecture for object segmentation, that overcomes this problem and provides comparable performance in a fraction of the time required by the state-of-the-art methods.
Our approach is validated on the YCB-Video dataset which is widely adopted in the computer vision and robotics community.
arXiv Detail & Related papers (2020-11-25T15:07:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.