ResFormer: Scaling ViTs with Multi-Resolution Training
- URL: http://arxiv.org/abs/2212.00776v2
- Date: Mon, 3 Apr 2023 06:55:09 GMT
- Title: ResFormer: Scaling ViTs with Multi-Resolution Training
- Authors: Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang
- Abstract summary: We introduce ResFormer, a framework for improved performance on a wide spectrum of, mostly unseen, testing resolutions.
In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales.
We demonstrate, moreover, ResFormer is flexible and can be easily extended to semantic segmentation, object detection and video action recognition.
- Score: 100.01406895070693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have achieved overwhelming success, yet they
suffer from vulnerable resolution scalability, i.e., the performance drops
drastically when presented with input resolutions that are unseen during
training. We introduce, ResFormer, a framework that is built upon the seminal
idea of multi-resolution training for improved performance on a wide spectrum
of, mostly unseen, testing resolutions. In particular, ResFormer operates on
replicated images of different resolutions and enforces a scale consistency
loss to engage interactive information across different scales. More
importantly, to alternate among varying resolutions effectively, especially
novel ones in testing, we propose a global-local positional embedding strategy
that changes smoothly conditioned on input sizes. We conduct extensive
experiments for image classification on ImageNet. The results provide strong
quantitative evidence that ResFormer has promising scaling abilities towards a
wide range of resolutions. For instance, ResFormer-B-MR achieves a Top-1
accuracy of 75.86% and 81.72% when evaluated on relatively low and high
resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better
than DeiT-B. We also demonstrate, moreover, ResFormer is flexible and can be
easily extended to semantic segmentation, object detection and video action
recognition. Code is available at https://github.com/ruitian12/resformer.
Related papers
- ViTAR: Vision Transformer with Any Resolution [80.95324692984903]
Vision Transformers experience a performance decline when processing resolutions different from those seen during training.
We introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions.
Our resulting model, ViTAR, demonstrates impressive adaptability, achieving 83.3% top-1 accuracy at a 1120x1120 resolution and 80.4% accuracy at a 4032x4032 resolution.
arXiv Detail & Related papers (2024-03-27T08:53:13Z) - Recurrent Multi-scale Transformer for High-Resolution Salient Object
Detection [68.65338791283298]
Salient Object Detection (SOD) aims to identify and segment the most conspicuous objects in an image or video.
Traditional SOD methods are largely limited to low-resolution images, making them difficult to adapt to the development of High-Resolution SOD.
In this work, we first propose a new HRS10K dataset, which contains 10,500 high-quality annotated images at 2K-8K resolution.
arXiv Detail & Related papers (2023-08-07T17:49:04Z) - Improving Performance of Object Detection using the Mechanisms of Visual
Recognition in Humans [0.4297070083645048]
We first track the performance of the state-of-the-art deep object recognition network, Faster- RCNN, as a function of image resolution.
They also show that different spatial frequencies convey different information about the objects in recognition process.
We propose a multi-resolution object recognition framework rather than a single-resolution network.
arXiv Detail & Related papers (2023-01-23T19:09:36Z) - Learning Resolution-Adaptive Representations for Cross-Resolution Person
Re-Identification [49.57112924976762]
Cross-resolution person re-identification problem aims to match low-resolution (LR) query identity images against high resolution (HR) gallery images.
It is a challenging and practical problem since the query images often suffer from resolution degradation due to the different capturing conditions from real-world cameras.
This paper explores an alternative SR-free paradigm to directly compare HR and LR images via a dynamic metric, which is adaptive to the resolution of a query image.
arXiv Detail & Related papers (2022-07-09T03:49:51Z) - Resolution based Feature Distillation for Cross Resolution Person
Re-Identification [17.86505685442293]
Person re-identification (re-id) aims to retrieve images of same identities across different camera views.
Resolution mismatch occurs due to varying distances between person of interest and cameras.
We propose a Resolution based Feature Distillation (RFD) approach to overcome the problem of multiple resolutions.
arXiv Detail & Related papers (2021-09-16T11:07:59Z) - Resolution-invariant Person ReID Based on Feature Transformation and
Self-weighted Attention [14.777001614779806]
Person Re-identification (ReID) is a critical computer vision task which aims to match the same person in images or video sequences.
We propose a novel two-stream network with a lightweight resolution association ReID feature transformation (RAFT) module and a self-weighted attention (SWA) ReID module.
Both modules are jointly trained to get a resolution-invariant representation.
arXiv Detail & Related papers (2021-01-12T15:22:41Z) - Resolution Switchable Networks for Runtime Efficient Image Recognition [46.09537029831355]
We propose a general method to train a single convolutional neural network which is capable of switching image resolutions at inference.
Networks trained with the proposed method are named Resolution Switchable Networks (RS-Nets)
arXiv Detail & Related papers (2020-07-19T02:12:59Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z) - Cross-Resolution Adversarial Dual Network for Person Re-Identification
and Beyond [59.149653740463435]
Person re-identification (re-ID) aims at matching images of the same person across camera views.
Due to varying distances between cameras and persons of interest, resolution mismatch can be expected.
We propose a novel generative adversarial network to address cross-resolution person re-ID.
arXiv Detail & Related papers (2020-02-19T07:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.