Related papers: Real-time Monocular Depth Estimation with Sparse Supervision on Mobile

Real-time Monocular Depth Estimation with Sparse Supervision on Mobile

URL: http://arxiv.org/abs/2105.12053v1
Date: Tue, 25 May 2021 16:33:28 GMT
Title: Real-time Monocular Depth Estimation with Sparse Supervision on Mobile
Authors: Mehmet Kerim Yucel, Valia Dimaridou, Anastasios Drosou, Albert Sa\`a-Garriga
Abstract summary: In recent years, with the increasing availability of mobile devices, accurate and mobile-friendly depth models have gained importance. We show, with key design choices and studies, even existing architecture can reach highly competitive performance. A version of our model achieves 0.1208 W on DIW with 1M parameters and reaches 44 FPS on a mobile GPU.
Score: 2.5425323889482336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monocular (relative or metric) depth estimation is a critical task for various applications, such as autonomous vehicles, augmented reality and image editing. In recent years, with the increasing availability of mobile devices, accurate and mobile-friendly depth models have gained importance. Increasingly accurate models typically require more computational resources, which inhibits the use of such models on mobile devices. The mobile use case is arguably the most unrestricted one, which requires highly accurate yet mobile-friendly architectures. Therefore, we try to answer the following question: How can we improve a model without adding further complexity (i.e. parameters)? Towards this end, we systematically explore the design space of a relative depth estimation model from various dimensions and we show, with key design choices and ablation studies, even an existing architecture can reach highly competitive performance to the state of the art, with a fraction of the complexity. Our study spans an in-depth backbone model selection process, knowledge distillation, intermediate predictions, model pruning and loss rebalancing. We show that our model, using only DIW as the supervisory dataset, achieves 0.1156 WHDR on DIW with 2.6M parameters and reaches 37 FPS on a mobile GPU, without pruning or hardware-specific optimization. A pruned version of our model achieves 0.1208 WHDR on DIW with 1M parameters and reaches 44 FPS on a mobile GPU.

Related papers

TinyLLM: A Framework for Training and Deploying Language Models at the Edge Computers [0.8499685241219366]
Language models have gained significant interest due to their general-purpose capabilities, which appear to emerge as models are scaled to increasingly larger parameter sizes. Large models impose stringent requirements on computing systems, necessitating significant memory and processing requirements for inference. This makes performing inference on mobile and edge devices challenging, often requiring invocating remotely-hosted models via network calls.
arXiv Detail & Related papers (2024-12-19T12:28:27Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
More precise edge detections [0.0]
Edge detection (ED) is a base task in computer vision. Current models still suffer from unsatisfactory precision rates. Model architecture for more precise predictions still needs an investigation.
arXiv Detail & Related papers (2024-07-29T13:24:55Z)
Mobile Foundation Model as Firmware [13.225478051091763]
sys is a collaborative management approach between the mobile OS and hardware. It amalgamates a curated selection of publicly available Large Language Models (LLMs) and facilitates dynamic data flow. It attains accuracy parity in 85% of tasks, demonstrates improved scalability in terms of storage and memory, and offers satisfactory inference speed.
arXiv Detail & Related papers (2023-08-28T07:21:26Z)
Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation [9.967643080731683]
We investigate the efficient combination of CNNs and Transformers, and design a hybrid architecture Lite-Mono. A full model outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
arXiv Detail & Related papers (2022-11-23T18:43:41Z)
Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments. We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges. MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks. We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
A Compact Deep Architecture for Real-time Saliency Prediction [42.58396452892243]
Saliency models aim to imitate the attention mechanism in the human visual system. Deep models have a high number of parameters which makes them less suitable for real-time applications. Here we propose a compact yet fast model for real-time saliency prediction.
arXiv Detail & Related papers (2020-08-30T17:47:16Z)
Tidying Deep Saliency Prediction Architectures [6.613005108411055]
In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks.
arXiv Detail & Related papers (2020-03-10T19:34:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.