ResT V2: Simpler, Faster and Stronger
- URL: http://arxiv.org/abs/2204.07366v1
- Date: Fri, 15 Apr 2022 07:57:40 GMT
- Title: ResT V2: Simpler, Faster and Stronger
- Authors: Qing-Long Zhang and Yu-Bin Yang
- Abstract summary: This paper proposes ResTv2, a simpler, faster, and stronger multi-scale vision Transformer for visual recognition.
We validate ResTv2 on ImageNet classification, COCO detection, and ADE20K semantic segmentation.
Experimental results show that the proposed ResTv2 can outperform the recently state-of-the-art backbones by a large margin.
- Score: 18.610152288982288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes ResTv2, a simpler, faster, and stronger multi-scale
vision Transformer for visual recognition. ResTv2 simplifies the EMSA structure
in ResTv1 (i.e., eliminating the multi-head interaction part) and employs an
upsample operation to reconstruct the lost medium- and high-frequency
information caused by the downsampling operation. In addition, we explore
different techniques for better apply ResTv2 backbones to downstream tasks. We
found that although combining EMSAv2 and window attention can greatly reduce
the theoretical matrix multiply FLOPs, it may significantly decrease the
computation density, thus causing lower actual speed. We comprehensively
validate ResTv2 on ImageNet classification, COCO detection, and ADE20K semantic
segmentation. Experimental results show that the proposed ResTv2 can outperform
the recently state-of-the-art backbones by a large margin, demonstrating the
potential of ResTv2 as solid backbones. The code and models will be made
publicly available at \url{https://github.com/wofmanaf/ResT}
Related papers
- Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal
Action Localization [65.33914980022303]
Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content.
Most methods can only train on pre-extracted features without optimizing them for the localization problem.
We propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL.
arXiv Detail & Related papers (2022-11-25T12:17:30Z) - BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to
Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications.
We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance.
We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z) - Asymmetric Learned Image Compression with Multi-Scale Residual Block,
Importance Map, and Post-Quantization Filtering [15.056672221375104]
Deep learning-based image compression has achieved better ratedistortion (R-D) performance than the latest traditional method, H.266/VVC.
Many leading learned schemes cannot maintain a good trade-off between performance and complexity.
We propose an effcient and effective image coding framework, which achieves similar R-D performance with lower complexity than the state of the art.
arXiv Detail & Related papers (2022-06-21T09:34:29Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Two-Stage is Enough: A Concise Deep Unfolding Reconstruction Network for
Flexible Video Compressive Sensing [7.154417066884072]
We show that a 2-stage deep unfolding network can lead to the state-of-the-art (SOTA) results in VCS.
We extend the proposed model for color VCS to perform joint reconstruction and demosaicing.
Our network is also flexible to the mask modulation and scale size for color VCS reconstruction so that a single trained network can be applied to different hardware systems.
arXiv Detail & Related papers (2022-01-15T09:40:22Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - Optical-Flow-Reuse-Based Bidirectional Recurrent Network for Space-Time
Video Super-Resolution [52.899234731501075]
Space-time video super-resolution (ST-VSR) simultaneously increases the spatial resolution and frame rate for a given video.
Existing methods typically suffer from difficulties in how to efficiently leverage information from a large range of neighboring frames.
We propose a coarse-to-fine bidirectional recurrent neural network instead of using ConvLSTM to leverage knowledge between adjacent frames.
arXiv Detail & Related papers (2021-10-13T15:21:30Z) - ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition.
We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z) - Replay and Synthetic Speech Detection with Res2net Architecture [85.20912636149552]
Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks.
This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability.
arXiv Detail & Related papers (2020-10-28T14:33:42Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.