Related papers: kMaX-DeepLab: k-means Mask Transformer

kMaX-DeepLab: k-means Mask Transformer

URL: http://arxiv.org/abs/2207.04044v5
Date: Mon, 10 Jul 2023 20:59:46 GMT
Title: kMaX-DeepLab: k-means Mask Transformer
Authors: Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
Abstract summary: Most existing transformer-based vision models simply borrow the idea from NLP. Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer for segmentation tasks. Our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU.
Score: 41.104116145904825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rise of transformers in vision tasks not only advances network backbone designs, but also starts a brand-new page to achieve end-to-end image recognition (e.g., object detection and panoptic segmentation). Originated from Natural Language Processing (NLP), transformer architectures, consisting of self-attention and cross-attention, effectively learn long-range interactions between elements in a sequence. However, we observe that most existing transformer-based vision models simply borrow the idea from NLP, neglecting the crucial difference between languages and images, particularly the extremely large sequence length of spatially flattened pixel features. This subsequently impedes the learning in cross-attention between pixel features and object queries. In this paper, we rethink the relationship between pixels and object queries and propose to reformulate the cross-attention learning as a clustering process. Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer (kMaX-DeepLab) for segmentation tasks, which not only improves the state-of-the-art, but also enjoys a simple and elegant design. As a result, our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU, and ADE20K val set with 50.9% PQ and 55.2% mIoU without test-time augmentation or external dataset. We hope our work can shed some light on designing transformers tailored for vision tasks. TensorFlow code and models are available at https://github.com/google-research/deeplab2 A PyTorch re-implementation is also available at https://github.com/bytedance/kmax-deeplab

Related papers

MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation [17.627376199097185]
We revisit pure convolution model and propose a novel panoptic architecture named MaskConver. MaskConver proposes to fully unify things and stuff representation by predicting their centers. We introduce a powerful ConvNeXt-UNet decoder that closes the performance gap between convolution- and transformerbased models.
arXiv Detail & Related papers (2023-12-11T00:52:26Z)
T-former: An Efficient Transformer for Image Inpainting [50.43302925662507]
A class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields. In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion, and based on this attention, a network called $T$-former is designed for image inpainting. Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity.
arXiv Detail & Related papers (2023-05-12T04:10:42Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical Image Segmentation [11.190117191084175]
This paper proposes TransDeepLab, a novel DeepLab-like pure Transformer for medical image segmentation. We exploit hierarchical Swin-Transformer with shifted windows to extend the DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module. Our approach performs superior or on par with most contemporary works on an amalgamation of Vision Transformer and CNN-based methods.
arXiv Detail & Related papers (2022-08-01T09:53:53Z)
MlTr: Multi-label Classification with Transformer [35.14232810099418]
We propose a Multi-label Transformer architecture (MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention. The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE.
arXiv Detail & Related papers (2021-06-11T06:53:09Z)
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning [60.75687261314962]
We introduce pixel-level pretext tasks for learning dense feature representations. A pixel-to-propagation consistency task produces better results than state-of-the-art approaches. Results demonstrate the strong potential of defining pretext tasks at the pixel level.
arXiv Detail & Related papers (2020-11-19T18:59:45Z)
Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks [70.0243910593064]
Key to success of vector quantization is deciding which parameter groups should be compressed together. In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function. We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress.
arXiv Detail & Related papers (2020-10-29T15:47:26Z)
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition [98.10703825716142]
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. We present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing.
arXiv Detail & Related papers (2020-06-20T10:19:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.