Related papers: Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

URL: http://arxiv.org/abs/2211.13202v1
Date: Wed, 23 Nov 2022 18:43:41 GMT
Title: Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
Authors: Ning Zhang, Francesco Nex, George Vosselman, Norman Kerle
Abstract summary: We investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono. A full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
Score: 9.967643080731683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised monocular depth estimation that does not require ground-truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models, so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. In this paper we achieve comparable results with a lightweight architecture. Specifically, we investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that our full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.

Related papers

An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z)
Scaling Laws for Native Multimodal Models [53.490942903659565]
We revisit the architectural design of native multimodal models and conduct an extensive scaling laws study. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones. We show that incorporating Mixture of Experts (MoEs) allows for models that learn modality-specific weights, significantly enhancing performance.
arXiv Detail & Related papers (2025-04-10T17:57:28Z)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism [67.56918651825056]
We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism. Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark. A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
arXiv Detail & Related papers (2025-03-03T12:19:06Z)
CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference [33.871080938643566]
Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead. We propose CMoE, a novel framework to efficiently carve MoE models from dense models. CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation.
arXiv Detail & Related papers (2025-02-06T14:05:30Z)
HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation [11.334990474402915]
We introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers. HAFormer achieves high performance with minimal computational overhead and compact model size.
arXiv Detail & Related papers (2024-07-10T07:53:24Z)
Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information. Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation [11.513054537848227]
We propose a pure transformer architecture called SideRT that can attain excellent predictions in real-time. This is the first work to show that transformer-based networks can attain state-of-the-art performance in real-time in the single image depth estimation field.
arXiv Detail & Related papers (2022-04-29T05:46:20Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
Real-time Monocular Depth Estimation with Sparse Supervision on Mobile [2.5425323889482336]
In recent years, with the increasing availability of mobile devices, accurate and mobile-friendly depth models have gained importance. We show, with key design choices and studies, even existing architecture can reach highly competitive performance. A version of our model achieves 0.1208 W on DIW with 1M parameters and reaches 44 FPS on a mobile GPU.
arXiv Detail & Related papers (2021-05-25T16:33:28Z)
A Compact Deep Architecture for Real-time Saliency Prediction [42.58396452892243]
Saliency models aim to imitate the attention mechanism in the human visual system. Deep models have a high number of parameters which makes them less suitable for real-time applications. Here we propose a compact yet fast model for real-time saliency prediction.
arXiv Detail & Related papers (2020-08-30T17:47:16Z)
S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures. We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z)
Tidying Deep Saliency Prediction Architectures [6.613005108411055]
In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks.
arXiv Detail & Related papers (2020-03-10T19:34:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.