Masked-attention Mask Transformer for Universal Image Segmentation
- URL: http://arxiv.org/abs/2112.01527v1
- Date: Thu, 2 Dec 2021 18:59:58 GMT
- Title: Masked-attention Mask Transformer for Universal Image Segmentation
- Authors: Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander
Kirillov and Rohit Girdhar
- Abstract summary: We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic)
Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions.
In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets.
- Score: 180.73009259614494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image segmentation is about grouping pixels with different semantics, e.g.,
category or instance membership, where each choice of semantics defines a task.
While only the semantics of each task differ, current research focuses on
designing specialized architectures for each task. We present Masked-attention
Mask Transformer (Mask2Former), a new architecture capable of addressing any
image segmentation task (panoptic, instance or semantic). Its key components
include masked attention, which extracts localized features by constraining
cross-attention within predicted mask regions. In addition to reducing the
research effort by at least three times, it outperforms the best specialized
architectures by a significant margin on four popular datasets. Most notably,
Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on
COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7
mIoU on ADE20K).
Related papers
- The revenge of BiSeNet: Efficient Multi-Task Image Segmentation [6.172605433695617]
BiSeNetFormer is a novel architecture for efficient multi-task image segmentation.
By seamlessly supporting multiple tasks, BiSeNetFormer offers a versatile solution for multi-task segmentation.
Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks.
arXiv Detail & Related papers (2024-04-15T08:32:18Z) - Unsupervised Universal Image Segmentation [59.0383635597103]
We propose an Unsupervised Universal model (U2Seg) adept at performing various image segmentation tasks.
U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models.
We then self-train the model on these pseudo semantic labels, yielding substantial performance gains.
arXiv Detail & Related papers (2023-12-28T18:59:04Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - Mask DINO: Towards A Unified Transformer-based Framework for Object
Detection and Segmentation [15.826822450977271]
Mask DINO is a unified object detection and segmentation framework.
Mask DINO is simple, efficient, scalable, and benefits from joint large-scale detection and segmentation datasets.
arXiv Detail & Related papers (2022-06-06T17:57:25Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - Few-shot semantic segmentation via mask aggregation [5.886986014593717]
Few-shot semantic segmentation aims to recognize novel classes with only very few labelled data.
Previous works have typically regarded it as a pixel-wise classification problem.
We introduce a mask-based classification method for addressing this problem.
arXiv Detail & Related papers (2022-02-15T07:13:09Z) - Per-Pixel Classification is Not All You Need for Semantic Segmentation [184.2905747595058]
Mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks.
We propose MaskFormer, a simple mask classification model which predicts a set of binary masks.
Our method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
arXiv Detail & Related papers (2021-07-13T17:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.