MP-Former: Mask-Piloted Transformer for Image Segmentation
- URL: http://arxiv.org/abs/2303.07336v2
- Date: Wed, 15 Mar 2023 17:30:03 GMT
- Title: MP-Former: Mask-Piloted Transformer for Image Segmentation
- Authors: Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M.
Ni, Lei Zhang
- Abstract summary: Mask2Former suffers from inconsistent mask predictions between decoder layers.
We propose a mask-piloted training approach, which feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones.
- Score: 16.620469868310288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a mask-piloted Transformer which improves masked-attention in
Mask2Former for image segmentation. The improvement is based on our observation
that Mask2Former suffers from inconsistent mask predictions between consecutive
decoder layers, which leads to inconsistent optimization goals and low
utilization of decoder queries. To address this problem, we propose a
mask-piloted training approach, which additionally feeds noised ground-truth
masks in masked-attention and trains the model to reconstruct the original
ones. Compared with the predicted masks used in mask-attention, the
ground-truth masks serve as a pilot and effectively alleviate the negative
impact of inaccurate mask predictions in Mask2Former. Based on this technique,
our \M achieves a remarkable performance improvement on all three image
segmentation tasks (instance, panoptic, and semantic), yielding $+2.3$AP and
$+1.6$mIoU on the Cityscapes instance and semantic segmentation tasks with a
ResNet-50 backbone. Our method also significantly speeds up the training,
outperforming Mask2Former with half of the number of training epochs on ADE20K
with both a ResNet-50 and a Swin-L backbones. Moreover, our method only
introduces little computation during training and no extra computation during
inference. Our code will be released at
\url{https://github.com/IDEA-Research/MP-Former}.
Related papers
- ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - DFormer: Diffusion-guided Transformer for Universal Image Segmentation [86.73405604947459]
The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model.
At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks.
Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val 2017 set.
arXiv Detail & Related papers (2023-06-06T06:33:32Z) - Mask Transfiner for High-Quality Instance Segmentation [95.74244714914052]
We present Mask Transfiner for high-quality and efficient instance segmentation.
Our approach only processes detected error-prone tree nodes and self-corrects their errors in parallel.
Our code and trained models will be available at http://vis.xyz/pub/transfiner.
arXiv Detail & Related papers (2021-11-26T18:58:22Z) - Mask is All You Need: Rethinking Mask R-CNN for Dense and
Arbitrary-Shaped Scene Text Detection [11.390163890611246]
Mask R-CNN is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting.
There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance.
We propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text.
arXiv Detail & Related papers (2021-09-08T04:32:29Z) - Boosting Masked Face Recognition with Multi-Task ArcFace [0.973681576519524]
Given the global health crisis caused by COVID-19, mouth and nose-covering masks have become an essential everyday-clothing-accessory.
This measure has put the state-of-the-art face recognition models on the ropes since they have not been designed to work with masked faces.
A full training pipeline is presented based on the ArcFace work, with several modifications for the backbone and the loss function.
arXiv Detail & Related papers (2021-04-20T10:12:04Z) - BoxInst: High-Performance Instance Segmentation with Box Annotations [102.10713189544947]
We present a high-performance method that can achieve mask-level instance segmentation with only bounding-box annotations for training.
Our core idea is to exploit the loss of learning masks in instance segmentation, with no modification to the segmentation network itself.
arXiv Detail & Related papers (2020-12-03T22:27:55Z) - DCT-Mask: Discrete Cosine Transform Mask Representation for Instance
Segmentation [50.70679435176346]
We propose a new mask representation by applying the discrete cosine transform(DCT) to encode the high-resolution binary grid mask into a compact vector.
Our method, termed DCT-Mask, could be easily integrated into most pixel-based instance segmentation methods.
arXiv Detail & Related papers (2020-11-19T15:00:21Z) - Fully Convolutional Networks for Automatically Generating Image Masks to
Train Mask R-CNN [4.901462756978097]
The Mask R-CNN method achieves the best results in object detection until now, however, it is very time-consuming and laborious to get the object Masks for training.
This paper proposes a novel automatically generating image masks method for the state-of-the-art Mask R-CNN deep learning method.
Our proposed method can obtain the image masks automatically to train Mask R-CNN, and it can achieve very high classification accuracy with an over 90% mean of average precision (mAP) for segmentation.
arXiv Detail & Related papers (2020-03-03T08:09:29Z) - BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation [103.74690082121079]
In this work, we achieve improved mask prediction by effectively combining instance-level information with semantic information with lower-level fine-granularity.
Our main contribution is a blender module which draws inspiration from both top-down and bottom-up instance segmentation approaches.
BlendMask can effectively predict dense per-pixel position-sensitive instance features with very few channels, and learn attention maps for each instance with merely one convolution layer.
arXiv Detail & Related papers (2020-01-02T03:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.