SOTR: Segmenting Objects with Transformers
- URL: http://arxiv.org/abs/2108.06747v2
- Date: Tue, 17 Aug 2021 04:15:21 GMT
- Title: SOTR: Segmenting Objects with Transformers
- Authors: Ruohao Guo, Dantong Niu, Liao Qu, Zhenbo Li
- Abstract summary: We present a novel, flexible, and effective transformer-based model for high-quality instance segmentation.
The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline.
Our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most recent transformer-based models show impressive performance on vision
tasks, even better than Convolution Neural Networks (CNN). In this work, we
present a novel, flexible, and effective transformer-based model for
high-quality instance segmentation. The proposed method, Segmenting Objects
with TRansformers (SOTR), simplifies the segmentation pipeline, building on an
alternative CNN backbone appended with two parallel subtasks: (1) predicting
per-instance category via transformer and (2) dynamically generating
segmentation mask with the multi-level upsampling module. SOTR can effectively
extract lower-level feature representations and capture long-range context
dependencies by Feature Pyramid Network (FPN) and twin transformer,
respectively. Meanwhile, compared with the original transformer, the proposed
twin transformer is time- and resource-efficient since only a row and a column
attention are involved to encode pixels. Moreover, SOTR is easy to be
incorporated with various CNN backbones and transformer model variants to make
considerable improvements for the segmentation accuracy and training
convergence. Extensive experiments show that our SOTR performs well on the MS
COCO dataset and surpasses state-of-the-art instance segmentation approaches.
We hope our simple but strong framework could serve as a preferment baseline
for instance-level recognition. Our code is available at
https://github.com/easton-cau/SOTR.
Related papers
- SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - End-to-End Referring Video Object Segmentation with Multimodal
Transformers [0.0]
We propose a simple Transformer-based approach to the referring video object segmentation task.
Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem.
MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps.
arXiv Detail & Related papers (2021-11-29T18:59:32Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.