SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2311.15537v2
- Date: Tue, 27 Feb 2024 10:59:30 GMT
- Title: SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
- Authors: Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang
- Abstract summary: Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories.
We propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation.
Our SED method achieves mIoU score of 31.6% on ADE20K with 150 categories at 82 millisecond ($ms$) per image on a single A6000.
- Score: 91.91385816767057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary semantic segmentation strives to distinguish pixels into
different semantic groups from an open set of categories. Most existing methods
explore utilizing pre-trained vision-language models, in which the key is to
adopt the image-level model for pixel-level segmentation task. In this paper,
we propose a simple encoder-decoder, named SED, for open-vocabulary semantic
segmentation, which comprises a hierarchical encoder-based cost map generation
and a gradual fusion decoder with category early rejection. The hierarchical
encoder-based cost map generation employs hierarchical backbone, instead of
plain transformer, to predict pixel-level image-text cost map. Compared to
plain transformer, hierarchical backbone better captures local spatial
information and has linear computational complexity with respect to input size.
Our gradual fusion decoder employs a top-down structure to combine cost map and
the feature maps of different backbone levels for segmentation. To accelerate
inference speed, we introduce a category early rejection scheme in the decoder
that rejects many no-existing categories at the early layer of decoder,
resulting in at most 4.7 times acceleration without accuracy degradation.
Experiments are performed on multiple open-vocabulary semantic segmentation
datasets, which demonstrates the efficacy of our SED method. When using
ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150
categories at 82 millisecond ($ms$) per image on a single A6000. We will
release it at \url{https://github.com/xb534/SED.git}.
Related papers
- CorrMatch: Label Propagation via Correlation Matching for
Semi-Supervised Semantic Segmentation [73.89509052503222]
This paper presents a simple but performant semi-supervised semantic segmentation approach, called CorrMatch.
We observe that the correlation maps not only enable clustering pixels of the same category easily but also contain good shape information.
We propose to conduct pixel propagation by modeling the pairwise similarities of pixels to spread the high-confidence pixels and dig out more.
Then, we perform region propagation to enhance the pseudo labels with accurate class-agnostic masks extracted from the correlation maps.
arXiv Detail & Related papers (2023-06-07T10:02:29Z) - Generalized Decoding for Pixel, Image, and Language [197.85760901840177]
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly.
X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks.
arXiv Detail & Related papers (2022-12-21T18:58:41Z) - Clustering as Attention: Unified Image Segmentation with Hierarchical
Clustering [11.696069523681178]
We propose a hierarchical clustering-based image segmentation scheme for deep neural networks, called HCFormer.
We interpret image segmentation, including semantic, instance, and panoptic segmentation, as a pixel clustering problem, and accomplish it by bottom-up, hierarchical clustering with deep neural networks.
arXiv Detail & Related papers (2022-05-20T03:53:56Z) - Small Lesion Segmentation in Brain MRIs with Subpixel Embedding [105.1223735549524]
We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues.
We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network.
arXiv Detail & Related papers (2021-09-18T00:21:17Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Evidential fully convolutional network for semantic segmentation [6.230751621285322]
We propose a hybrid architecture composed of a fully convolutional network (FCN) and a Dempster-Shafer layer for image semantic segmentation.
Experiments show that the proposed combination improves the accuracy and calibration of semantic segmentation by assigning confusing pixels to multi-class sets.
arXiv Detail & Related papers (2021-03-25T01:21:22Z) - A Novel Upsampling and Context Convolution for Image Semantic
Segmentation [0.966840768820136]
Recent methods for semantic segmentation often employ an encoder-decoder structure using deep convolutional neural networks.
We propose a dense upsampling convolution method based on guided filtering to effectively preserve the spatial information of the image in the network.
We report a new record of 82.86% and 81.62% of pixel accuracy on ADE20K and Pascal-Context benchmark datasets, respectively.
arXiv Detail & Related papers (2021-03-20T06:16:42Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.