Per-Pixel Classification is Not All You Need for Semantic Segmentation
- URL: http://arxiv.org/abs/2107.06278v1
- Date: Tue, 13 Jul 2021 17:59:50 GMT
- Title: Per-Pixel Classification is Not All You Need for Semantic Segmentation
- Authors: Bowen Cheng and Alexander G. Schwing and Alexander Kirillov
- Abstract summary: Mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks.
We propose MaskFormer, a simple mask classification model which predicts a set of binary masks.
Our method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
- Score: 184.2905747595058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern approaches typically formulate semantic segmentation as a per-pixel
classification task, while instance-level segmentation is handled with an
alternative mask classification. Our key insight: mask classification is
sufficiently general to solve both semantic- and instance-level segmentation
tasks in a unified manner using the exact same model, loss, and training
procedure. Following this observation, we propose MaskFormer, a simple mask
classification model which predicts a set of binary masks, each associated with
a single global class label prediction. Overall, the proposed mask
classification-based method simplifies the landscape of effective approaches to
semantic and panoptic segmentation tasks and shows excellent empirical results.
In particular, we observe that MaskFormer outperforms per-pixel classification
baselines when the number of classes is large. Our mask classification-based
method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K)
and panoptic segmentation (52.7 PQ on COCO) models.
Related papers
- Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation [29.43462426812185]
We propose a paradigm change by shifting from a per-pixel classification to a mask classification.
Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture.
By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results.
arXiv Detail & Related papers (2023-09-08T20:07:18Z) - MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation.
Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text.
With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z) - Unmasking Anomalies in Road-Scene Segmentation [18.253109627901566]
Anomaly segmentation is a critical task for driving applications.
We propose a paradigm change by shifting from a per-pixel classification to a mask classification.
Mask2Anomaly demonstrates the feasibility of integrating an anomaly detection method in a mask-classification architecture.
arXiv Detail & Related papers (2023-07-25T08:23:10Z) - MaskRange: A Mask-classification Model for Range-view based LiDAR
Segmentation [34.04740351544143]
We propose a unified mask-classification model, MaskRange, for the range-view based LiDAR semantic and panoptic segmentation.
Our MaskRange achieves state-of-the-art performance with $66.10$ mIoU on semantic segmentation and promising results with $53.10$ PQ on panoptic segmentation with high efficiency.
arXiv Detail & Related papers (2022-06-24T04:39:49Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z) - Few-shot semantic segmentation via mask aggregation [5.886986014593717]
Few-shot semantic segmentation aims to recognize novel classes with only very few labelled data.
Previous works have typically regarded it as a pixel-wise classification problem.
We introduce a mask-based classification method for addressing this problem.
arXiv Detail & Related papers (2022-02-15T07:13:09Z) - Masked-attention Mask Transformer for Universal Image Segmentation [180.73009259614494]
We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic)
Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions.
In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets.
arXiv Detail & Related papers (2021-12-02T18:59:58Z) - Scaling up instance annotation via label propagation [69.8001043244044]
We propose a highly efficient annotation scheme for building large datasets with object segmentation masks.
We exploit these similarities by using hierarchical clustering on mask predictions made by a segmentation model.
We show that we obtain 1M object segmentation masks with a total annotation time of only 290 hours.
arXiv Detail & Related papers (2021-10-05T18:29:34Z) - Investigating and Simplifying Masking-based Saliency Methods for Model
Interpretability [5.387323728379395]
Saliency maps that identify the most informative regions of an image are valuable for model interpretability.
A common approach to creating saliency maps involves generating input masks that mask out portions of an image.
We show that a masking model can be trained with as few as 10 examples per class and still generate saliency maps with only a 0.7-point increase in localization error.
arXiv Detail & Related papers (2020-10-19T18:00:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.