Representation Separation for Semantic Segmentation with Vision
Transformers
- URL: http://arxiv.org/abs/2212.13764v1
- Date: Wed, 28 Dec 2022 09:54:52 GMT
- Title: Representation Separation for Semantic Segmentation with Vision
Transformers
- Authors: Yuanduo Hong, Huihui Pan, Weichao Sun, Xinghu Yu, and Huijun Gao
- Abstract summary: Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.
We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs.
- Score: 11.431694321563322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) encoding an image as a sequence of patches bring
new paradigms for semantic segmentation.We present an efficient framework of
representation separation in local-patch level and global-region level for
semantic segmentation with ViTs. It is targeted for the peculiar
over-smoothness of ViTs in semantic segmentation, and therefore differs from
current popular paradigms of context modeling and most existing related methods
reinforcing the advantage of attention. We first deliver the decoupled
two-pathway network in which another pathway enhances and passes down
local-patch discrepancy complementary to global representations of
transformers. We then propose the spatially adaptive separation module to
obtain more separate deep representations and the discriminative
cross-attention which yields more discriminative region representations through
novel auxiliary supervisions. The proposed methods achieve some impressive
results: 1) incorporated with large-scale plain ViTs, our methods achieve new
state-of-the-art performances on five widely used benchmarks; 2) using masked
pre-trained plain ViTs, we achieve 68.9% mIoU on Pascal Context, setting a new
record; 3) pyramid ViTs integrated with the decoupled two-pathway network even
surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved
representations by our framework have favorable transferability in images with
natural corruptions. The codes will be released publicly.
Related papers
- Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z) - Dual-Augmented Transformer Network for Weakly Supervised Semantic
Segmentation [4.02487511510606]
Weakly supervised semantic segmentation (WSSS) is a fundamental computer vision task, which aims to segment out the object within only class-level labels.
Traditional methods adopt the CNN-based network and utilize the class activation map (CAM) strategy to discover the object regions.
An alternative is to explore vision transformers (ViT) to encode the image to acquire the global semantic information.
We propose a dual network with both CNN-based and transformer networks for mutually complementary learning.
arXiv Detail & Related papers (2023-09-30T08:41:11Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation.
ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection.
This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z) - Siamese DETR [87.45960774877798]
We present Siamese DETR, a self-supervised pretraining approach for the Transformer architecture in DETR.
We consider learning view-invariant and detection-oriented representations simultaneously through two complementary tasks.
The proposed Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL VOC detection.
arXiv Detail & Related papers (2023-03-31T15:29:25Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic
Segmentation [48.7190017311309]
We find that straightforwardly applying local ViTs in domain adaptive semantic segmentation does not bring in expected improvement.
These high-frequency components make the training of local ViTs very unsmooth and hurt their transferability.
In this paper, we introduce a low-pass filtering mechanism, momentum network, to smooth the learning dynamics of target domain features and pseudo labels.
arXiv Detail & Related papers (2022-03-15T15:20:30Z) - HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones.
We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains.
Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.