MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation
- URL: http://arxiv.org/abs/2312.06052v1
- Date: Mon, 11 Dec 2023 00:52:26 GMT
- Title: MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation
- Authors: Abdullah Rashwan, Jiageng Zhang, Ali Taalimi, Fan Yang, Xingyi Zhou,
Chaochao Yan, Liang-Chieh Chen, Yeqing Li
- Abstract summary: We revisit pure convolution model and propose a novel panoptic architecture named MaskConver.
MaskConver proposes to fully unify things and stuff representation by predicting their centers.
We introduce a powerful ConvNeXt-UNet decoder that closes the performance gap between convolution- and transformerbased models.
- Score: 17.627376199097185
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, transformer-based models have dominated panoptic
segmentation, thanks to their strong modeling capabilities and their unified
representation for both semantic and instance classes as global binary masks.
In this paper, we revisit pure convolution model and propose a novel panoptic
architecture named MaskConver. MaskConver proposes to fully unify things and
stuff representation by predicting their centers. To that extent, it creates a
lightweight class embedding module that can break the ties when multiple
centers co-exist in the same location. Furthermore, our study shows that the
decoder design is critical in ensuring that the model has sufficient context
for accurate detection and segmentation. We introduce a powerful ConvNeXt-UNet
decoder that closes the performance gap between convolution- and
transformerbased models. With ResNet50 backbone, our MaskConver achieves 53.6%
PQ on the COCO panoptic val set, outperforming the modern convolution-based
model, Panoptic FCN, by 9.3% as well as transformer-based models such as
Mask2Former (+1.7% PQ) and kMaX-DeepLab (+0.6% PQ). Additionally, MaskConver
with a MobileNet backbone reaches 37.2% PQ, improving over Panoptic-DeepLab by
+6.4% under the same FLOPs/latency constraints. A further optimized version of
MaskConver achieves 29.7% PQ, while running in real-time on mobile devices. The
code and model weights will be publicly available
Related papers
- Pre-training Point Cloud Compact Model with Partial-aware Reconstruction [51.403810709250024]
We present a pre-trained Point cloud Compact Model with Partial-aware textbfReconstruction, named Point-CPR.
Our model exhibits strong performance across various tasks, especially surpassing the leading MPM-based model PointGPT-B with only 2% of its parameters.
arXiv Detail & Related papers (2024-07-12T15:18:14Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - You Only Segment Once: Towards Real-Time Panoptic Segmentation [68.91492389185744]
YOSO is a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps.
YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K.
arXiv Detail & Related papers (2023-03-26T07:55:35Z) - Designing BERT for Convolutional Networks: Sparse and Hierarchical
Masked Modeling [23.164631160130092]
We extend the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets)
We treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode.
This is the first use of sparse convolution for 2D masked modeling.
arXiv Detail & Related papers (2023-01-09T18:59:50Z) - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z) - kMaX-DeepLab: k-means Mask Transformer [41.104116145904825]
Most existing transformer-based vision models simply borrow the idea from NLP.
Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer for segmentation tasks.
Our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU.
arXiv Detail & Related papers (2022-07-08T17:59:01Z) - ConvMAE: Masked Convolution Meets Masked Autoencoders [65.15953258300958]
Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT.
Our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base.
arXiv Detail & Related papers (2022-05-08T15:12:19Z) - Mask Transfiner for High-Quality Instance Segmentation [95.74244714914052]
We present Mask Transfiner for high-quality and efficient instance segmentation.
Our approach only processes detected error-prone tree nodes and self-corrects their errors in parallel.
Our code and trained models will be available at http://vis.xyz/pub/transfiner.
arXiv Detail & Related papers (2021-11-26T18:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.