Lite-Mono: A Lightweight CNN and Transformer Architecture for
Self-Supervised Monocular Depth Estimation
- URL: http://arxiv.org/abs/2211.13202v1
- Date: Wed, 23 Nov 2022 18:43:41 GMT
- Title: Lite-Mono: A Lightweight CNN and Transformer Architecture for
Self-Supervised Monocular Depth Estimation
- Authors: Ning Zhang, Francesco Nex, George Vosselman, Norman Kerle
- Abstract summary: We investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono.
A full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
- Score: 9.967643080731683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation that does not require ground-truth
for training has attracted attention in recent years. It is of high interest to
design lightweight but effective models, so that they can be deployed on edge
devices. Many existing architectures benefit from using heavier backbones at
the expense of model sizes. In this paper we achieve comparable results with a
lightweight architecture. Specifically, we investigate the efficient
combination of CNNs and Transformers, and design a hybrid architecture
Lite-Mono. A Consecutive Dilated Convolutions (CDC) module and a Local-Global
Features Interaction (LGFI) module are proposed. The former is used to extract
rich multi-scale local features, and the latter takes advantage of the
self-attention mechanism to encode long-range global information into the
features. Experiments demonstrate that our full model outperforms Monodepth2 by
a large margin in accuracy, with about 80% fewer trainable parameters.
Related papers
- CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference [33.871080938643566]
Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead.
We propose CMoE, a novel framework to efficiently carve MoE models from dense models.
CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation.
arXiv Detail & Related papers (2025-02-06T14:05:30Z) - HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation [11.334990474402915]
We introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers.
HAFormer achieves high performance with minimal computational overhead and compact model size.
arXiv Detail & Related papers (2024-07-10T07:53:24Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - SideRT: A Real-time Pure Transformer Architecture for Single Image Depth
Estimation [11.513054537848227]
We propose a pure transformer architecture called SideRT that can attain excellent predictions in real-time.
This is the first work to show that transformer-based networks can attain state-of-the-art performance in real-time in the single image depth estimation field.
arXiv Detail & Related papers (2022-04-29T05:46:20Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Real-time Monocular Depth Estimation with Sparse Supervision on Mobile [2.5425323889482336]
In recent years, with the increasing availability of mobile devices, accurate and mobile-friendly depth models have gained importance.
We show, with key design choices and studies, even existing architecture can reach highly competitive performance.
A version of our model achieves 0.1208 W on DIW with 1M parameters and reaches 44 FPS on a mobile GPU.
arXiv Detail & Related papers (2021-05-25T16:33:28Z) - A Compact Deep Architecture for Real-time Saliency Prediction [42.58396452892243]
Saliency models aim to imitate the attention mechanism in the human visual system.
Deep models have a high number of parameters which makes them less suitable for real-time applications.
Here we propose a compact yet fast model for real-time saliency prediction.
arXiv Detail & Related papers (2020-08-30T17:47:16Z) - S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures.
We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.