Related papers: SOTR: Segmenting Objects with Transformers

SOTR: Segmenting Objects with Transformers

URL: http://arxiv.org/abs/2108.06747v2
Date: Tue, 17 Aug 2021 04:15:21 GMT
Title: SOTR: Segmenting Objects with Transformers
Authors: Ruohao Guo, Dantong Niu, Liao Qu, Zhenbo Li
Abstract summary: We present a novel, flexible, and effective transformer-based model for high-quality instance segmentation. The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline. Our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present a novel, flexible, and effective transformer-based model for high-quality instance segmentation. The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline, building on an alternative CNN backbone appended with two parallel subtasks: (1) predicting per-instance category via transformer and (2) dynamically generating segmentation mask with the multi-level upsampling module. SOTR can effectively extract lower-level feature representations and capture long-range context dependencies by Feature Pyramid Network (FPN) and twin transformer, respectively. Meanwhile, compared with the original transformer, the proposed twin transformer is time- and resource-efficient since only a row and a column attention are involved to encode pixels. Moreover, SOTR is easy to be incorporated with various CNN backbones and transformer model variants to make considerable improvements for the segmentation accuracy and training convergence. Extensive experiments show that our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches. We hope our simple but strong framework could serve as a preferment baseline for instance-level recognition. Our code is available at https://github.com/easton-cau/SOTR.

Related papers

SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows. We design a lightweight yet effective transformer model, called SSformer. Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z)
Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection. Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z)
Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture. Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context. Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z)
Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks. We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z)
End-to-End Referring Video Object Segmentation with Multimodal Transformers [0.0]
We propose a simple Transformer-based approach to the referring video object segmentation task. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps.
arXiv Detail & Related papers (2021-11-29T18:59:32Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN) We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT) Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z)
Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.