Vision Transformers: From Semantic Segmentation to Dense Prediction
- URL: http://arxiv.org/abs/2207.09339v3
- Date: Thu, 12 Oct 2023 09:13:37 GMT
- Title: Vision Transformers: From Semantic Segmentation to Dense Prediction
- Authors: Li Zhang, Jiachen Lu, Sixiao Zheng, Xinxuan Zhao, Xiatian Zhu, Yanwei
Fu, Tao Xiang, Jianfeng Feng, Philip H.S. Torr
- Abstract summary: Vision transformers (ViTs) in image classification have shifted the methodologies for visual representation learning.
In this work, we explore the global context learning potentials of ViTs for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
- Score: 144.38869017091199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of vision transformers (ViTs) in image classification has
shifted the methodologies for visual representation learning. In particular,
ViTs learn visual representation at full receptive field per layer across all
the image patches, in comparison to the increasing receptive fields of CNNs
across layers and other alternatives (e.g., large kernels and atrous
convolution). In this work, for the first time we explore the global context
learning potentials of ViTs for dense visual prediction (e.g., semantic
segmentation). Our motivation is that through learning global context at full
receptive field layer by layer, ViTs may capture stronger long-range dependency
information, critical for dense prediction tasks. We first demonstrate that
encoding an image as a sequence of patches, a vanilla ViT without local
convolution and resolution reduction can yield stronger visual representation
for semantic segmentation. For example, our model, termed as SEgmentation
TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the
test leaderboard on the day of submission) and Pascal Context (55.83% mIoU),
and performs competitively on Cityscapes. For tackling general dense visual
prediction tasks in a cost-effective manner, we further formulate a family of
Hierarchical Local-Global (HLG) Transformers, characterized by local attention
within windows and global-attention across windows in a pyramidal architecture.
Extensive experiments show that our methods achieve appealing performance on a
variety of dense prediction tasks (e.g., object detection and instance
segmentation and semantic segmentation) as well as image classification. Our
code and models are available at https://github.com/fudan-zvg/SETR.
Related papers
- GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z) - Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation.
ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection.
This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Efficient Hybrid Transformer: Learning Global-local Context for Urban
Sence Segmentation [11.237929167356725]
We propose an efficient hybrid Transformer (EHT) for semantic segmentation of urban scene images.
EHT takes advantage of CNNs and Transformer, learning global-local context to strengthen the feature representation.
The proposed EHT achieves a 67.0% mIoU on the UAVid test set and outperforms other lightweight models significantly.
arXiv Detail & Related papers (2021-09-18T13:55:38Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and
Semantic Point Cloud [21.29622194272066]
We focus on bird's eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images.
There are two main challenges to this task: the view transformation from side view to bird's eye view, as well as transfer learning to unseen domains.
Our novel 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner.
arXiv Detail & Related papers (2020-06-19T23:30:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.