Related papers: Depth Estimation with Simplified Transformer

Depth Estimation with Simplified Transformer

URL: http://arxiv.org/abs/2204.13791v1
Date: Thu, 28 Apr 2022 21:39:00 GMT
Title: Depth Estimation with Simplified Transformer
Authors: John Yang, Le An, Anurag Dixit, Jinkyu Koo, Su Inn Park
Abstract summary: Transformer and its variants have shown state-of-the-art results in many vision tasks recently. We propose a method for self-supervised monocular Depth Estimation with simplified Transformer (DEST) Our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art.
Score: 4.565830918989131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer and its variants have shown state-of-the-art results in many vision tasks recently, ranging from image classification to dense prediction. Despite of their success, limited work has been reported on improving the model efficiency for deployment in latency-critical applications, such as autonomous driving and robotic navigation. In this paper, we aim at improving upon the existing transformers in vision, and propose a method for self-supervised monocular Depth Estimation with Simplified Transformer (DEST), which is efficient and particularly suitable for deployment on GPU-based platforms. Through strategic design choices, our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art. We also show that our design generalize well to other dense prediction task without bells and whistles.

Related papers

Efficient Partitioning Vision Transformer on Edge Devices for Distributed Inference [13.533267828812455]
We propose a novel framework, ED-ViT, which is designed to efficiently split and execute complex Vision Transformers across multiple edge devices.<n>Our approach involves partitioning Vision Transformer models into several sub-models, while each dedicated to handling a specific subset of data classes.<n>We demonstrate that our method significantly cuts down inference latency on edge devices and achieves a reduction in model size by up to 28.9 times and 34.1 times, respectively.
arXiv Detail & Related papers (2024-10-15T14:38:14Z)
Research on Personalized Compression Algorithm for Pre-trained Models Based on Homomorphic Entropy Increase [2.6513322539118582]
We explore the challenges and evolution of two key technologies in the current field of AI: Vision Transformer model and Large Language Model (LLM) Vision Transformer captures global information by splitting images into small pieces, but its high reference count and compute overhead limit deployment on mobile devices. LLM has revolutionized natural language processing, but it also faces huge deployment challenges.
arXiv Detail & Related papers (2024-08-16T11:56:49Z)
Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution [6.367865391518726]
Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR) To activate more input pixels globally, hybrid attention models have been proposed. We employ wavelet losses to train Transformer models to improve quantitative and subjective performance.
arXiv Detail & Related papers (2024-04-17T11:25:19Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Patch Similarity Aware Data-Free Quantization for Vision Transformers [2.954890575035673]
We propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework for Vision Transformers. We analyze the self-attention module's properties and reveal a general difference (patch similarity) in its processing of Gaussian noise and real images. Experiments and ablation studies are conducted on various benchmarks to validate the effectiveness of PSAQ-ViT.
arXiv Detail & Related papers (2022-03-04T11:47:20Z)
Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning. We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z)
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use. Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z)
Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models. We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts. Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.