kMaX-DeepLab: k-means Mask Transformer
- URL: http://arxiv.org/abs/2207.04044v5
- Date: Mon, 10 Jul 2023 20:59:46 GMT
- Title: kMaX-DeepLab: k-means Mask Transformer
- Authors: Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu,
Hartwig Adam, Alan Yuille, Liang-Chieh Chen
- Abstract summary: Most existing transformer-based vision models simply borrow the idea from NLP.
Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer for segmentation tasks.
Our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU.
- Score: 41.104116145904825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rise of transformers in vision tasks not only advances network backbone
designs, but also starts a brand-new page to achieve end-to-end image
recognition (e.g., object detection and panoptic segmentation). Originated from
Natural Language Processing (NLP), transformer architectures, consisting of
self-attention and cross-attention, effectively learn long-range interactions
between elements in a sequence. However, we observe that most existing
transformer-based vision models simply borrow the idea from NLP, neglecting the
crucial difference between languages and images, particularly the extremely
large sequence length of spatially flattened pixel features. This subsequently
impedes the learning in cross-attention between pixel features and object
queries. In this paper, we rethink the relationship between pixels and object
queries and propose to reformulate the cross-attention learning as a clustering
process. Inspired by the traditional k-means clustering algorithm, we develop a
k-means Mask Xformer (kMaX-DeepLab) for segmentation tasks, which not only
improves the state-of-the-art, but also enjoys a simple and elegant design. As
a result, our kMaX-DeepLab achieves a new state-of-the-art performance on COCO
val set with 58.0% PQ, Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5%
mIoU, and ADE20K val set with 50.9% PQ and 55.2% mIoU without test-time
augmentation or external dataset. We hope our work can shed some light on
designing transformers tailored for vision tasks. TensorFlow code and models
are available at https://github.com/google-research/deeplab2 A PyTorch
re-implementation is also available at
https://github.com/bytedance/kmax-deeplab
Related papers
- MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation [17.627376199097185]
We revisit pure convolution model and propose a novel panoptic architecture named MaskConver.
MaskConver proposes to fully unify things and stuff representation by predicting their centers.
We introduce a powerful ConvNeXt-UNet decoder that closes the performance gap between convolution- and transformerbased models.
arXiv Detail & Related papers (2023-12-11T00:52:26Z) - T-former: An Efficient Transformer for Image Inpainting [50.43302925662507]
A class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields.
In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion, and based on this attention, a network called $T$-former is designed for image inpainting.
Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity.
arXiv Detail & Related papers (2023-05-12T04:10:42Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical
Image Segmentation [11.190117191084175]
This paper proposes TransDeepLab, a novel DeepLab-like pure Transformer for medical image segmentation.
We exploit hierarchical Swin-Transformer with shifted windows to extend the DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module.
Our approach performs superior or on par with most contemporary works on an amalgamation of Vision Transformer and CNN-based methods.
arXiv Detail & Related papers (2022-08-01T09:53:53Z) - MlTr: Multi-label Classification with Transformer [35.14232810099418]
We propose a Multi-label Transformer architecture (MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention.
The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE.
arXiv Detail & Related papers (2021-06-11T06:53:09Z) - Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised
Visual Representation Learning [60.75687261314962]
We introduce pixel-level pretext tasks for learning dense feature representations.
A pixel-to-propagation consistency task produces better results than state-of-the-art approaches.
Results demonstrate the strong potential of defining pretext tasks at the pixel level.
arXiv Detail & Related papers (2020-11-19T18:59:45Z) - Permute, Quantize, and Fine-tune: Efficient Compression of Neural
Networks [70.0243910593064]
Key to success of vector quantization is deciding which parameter groups should be compressed together.
In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function.
We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress.
arXiv Detail & Related papers (2020-10-29T15:47:26Z) - Pyramidal Convolution: Rethinking Convolutional Neural Networks for
Visual Recognition [98.10703825716142]
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.
We present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing.
arXiv Detail & Related papers (2020-06-20T10:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.