SideRT: A Real-time Pure Transformer Architecture for Single Image Depth
Estimation
- URL: http://arxiv.org/abs/2204.13892v1
- Date: Fri, 29 Apr 2022 05:46:20 GMT
- Title: SideRT: A Real-time Pure Transformer Architecture for Single Image Depth
Estimation
- Authors: Chang Shu, Ziming Chen, Lei Chen, Kuan Ma, Minghui Wang and Haibing
Ren
- Abstract summary: We propose a pure transformer architecture called SideRT that can attain excellent predictions in real-time.
This is the first work to show that transformer-based networks can attain state-of-the-art performance in real-time in the single image depth estimation field.
- Score: 11.513054537848227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since context modeling is critical for estimating depth from a single image,
researchers put tremendous effort into obtaining global context. Many global
manipulations are designed for traditional CNN-based architectures to overcome
the locality of convolutions. Attention mechanisms or transformers originally
designed for capturing long-range dependencies might be a better choice, but
usually complicates architectures and could lead to a decrease in inference
speed. In this work, we propose a pure transformer architecture called SideRT
that can attain excellent predictions in real-time. In order to capture better
global context, Cross-Scale Attention (CSA) and Multi-Scale Refinement (MSR)
modules are designed to work collaboratively to fuse features of different
scales efficiently. CSA modules focus on fusing features of high semantic
similarities, while MSR modules aim to fuse features at corresponding
positions. These two modules contain a few learnable parameters without
convolutions, based on which a lightweight yet effective model is built. This
architecture achieves state-of-the-art performances in real-time (51.3 FPS) and
becomes much faster with a reasonable performance drop on a smaller backbone
Swin-T (83.1 FPS). Furthermore, its performance surpasses the previous
state-of-the-art by a large margin, improving AbsRel metric 6.9% on KITTI and
9.7% on NYU. To the best of our knowledge, this is the first work to show that
transformer-based networks can attain state-of-the-art performance in real-time
in the single image depth estimation field. Code will be made available soon.
Related papers
- ContextFormer: Redefining Efficiency in Semantic Segmentation [46.06496660333768]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.
Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.
We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation [11.334990474402915]
We introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers.
HAFormer achieves high performance with minimal computational overhead and compact model size.
arXiv Detail & Related papers (2024-07-10T07:53:24Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - CompletionFormer: Depth Completion with Convolutions and Vision
Transformers [0.0]
This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
Our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods.
arXiv Detail & Related papers (2023-04-25T17:59:47Z) - Lite-Mono: A Lightweight CNN and Transformer Architecture for
Self-Supervised Monocular Depth Estimation [9.967643080731683]
We investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono.
A full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
arXiv Detail & Related papers (2022-11-23T18:43:41Z) - Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly.
A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work.
It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z) - Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining.
A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery.
Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.