Related papers: EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model

EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model

URL: http://arxiv.org/abs/2509.22527v1
Date: Fri, 26 Sep 2025 16:05:43 GMT
Title: EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model
Authors: Andrii Litvynchuk, Ivan Livinsky, Anand Ravi, Nima Kalantari, Andrii Tsarov,
Abstract summary: We introduce a novel MDE system, called EfficientDepth, which combines a transformer architecture with a lightweight convolutional decoder.<n>We train our model on a combination of labeled synthetic and real images, as well as pseudo-labeled real images, generated using a high-performing MDE method.<n>In addition to commonly used objectives, we introduce a loss function based on LPIPS to encourage the network to produce detailed depth maps.
Score: 1.4525559282354221
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Monocular depth estimation (MDE) plays a pivotal role in various computer vision applications, such as robotics, augmented reality, and autonomous driving. Despite recent advancements, existing methods often fail to meet key requirements for 3D reconstruction and view synthesis, including geometric consistency, fine details, robustness to real-world challenges like reflective surfaces, and efficiency for edge devices. To address these challenges, we introduce a novel MDE system, called EfficientDepth, which combines a transformer architecture with a lightweight convolutional decoder, as well as a bimodal density head that allows the network to estimate detailed depth maps. We train our model on a combination of labeled synthetic and real images, as well as pseudo-labeled real images, generated using a high-performing MDE method. Furthermore, we employ a multi-stage optimization strategy to improve training efficiency and produce models that emphasize geometric consistency and fine detail. Finally, in addition to commonly used objectives, we introduce a loss function based on LPIPS to encourage the network to produce detailed depth maps. Experimental results demonstrate that EfficientDepth achieves performance comparable to or better than existing state-of-the-art models, with significantly reduced computational resources.

Related papers

Enhanced Mixture 3D CGAN for Completion and Generation of 3D Objects [0.2624902795082451]
The generation and completion of 3D objects represent a transformative challenge in computer vision.<n>In this paper, we investigate the integration of Deep 3D Convolutional GANs with a MoE framework to generate high-quality 3D models.
arXiv Detail & Related papers (2026-02-08T16:32:41Z)
MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts [50.37005070020306]
MoRE is a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture.<n>MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.<n>It integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction.
arXiv Detail & Related papers (2025-10-31T06:54:27Z)
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation [44.75113949778924]
ARTDECO is a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines.<n>We show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization.
arXiv Detail & Related papers (2025-10-09T17:57:38Z)
Optimized Unet with Attention Mechanism for Multi-Scale Semantic Segmentation [8.443350618722564]
This paper proposes an improved Unet model combined with an attention mechanism.<n>It introduces channel attention and spatial attention modules, enhances the model's ability to focus on important features.<n>The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively.
arXiv Detail & Related papers (2025-02-06T06:51:23Z)
Distillation of Diffusion Features for Semantic Correspondence [23.54555663670558]
We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency.<n>We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost.<n>Our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence.
arXiv Detail & Related papers (2024-12-04T17:55:33Z)
MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection [9.780498146964097]
We propose an innovative network architecture, MonoMM, for real-time monocular 3D object detection. MonoMM consists of Focused Multi-Scale Fusion (FMF) and Depth-Aware Feature Enhancement Mamba (DMB) modules. Our method outperforms previous monocular methods and achieves real-time detection.
arXiv Detail & Related papers (2024-08-01T10:16:58Z)
E2ENet: Dynamic Sparse Feature Fusion for Accurate and Efficient 3D Medical Image Segmentation [34.865695471451886]
We propose a 3D medical image segmentation model, named Efficient to Efficient Network (E2ENet)<n>It incorporates two parametrically and computationally efficient designs.<n>It consistently achieves a superior trade-off between accuracy and efficiency across various resource constraints.
arXiv Detail & Related papers (2023-12-07T22:13:37Z)
Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. CNNs are used to augment the local texture information of coarse priors. DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z)
Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments. We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges. MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks. We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z)
Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics. Recent neural implicit modeling methods show promising results on synthetic or dense datasets. But, they perform poorly on real-world data that is sparse and noisy. This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video. Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer. To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
Accurate and Lightweight Image Super-Resolution with Model-Guided Deep Unfolding Network [63.69237156340457]
We present and advocate an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN) MoG-DUN is accurate (producing fewer aliasing artifacts), computationally efficient (with reduced model parameters), and versatile (capable of handling multiple degradations) The superiority of the proposed MoG-DUN method to existing state-of-theart image methods including RCAN, SRDNF, and SRFBN is substantiated by extensive experiments on several popular datasets and various degradation scenarios.
arXiv Detail & Related papers (2020-09-14T08:23:37Z)
PaMIR: Parametric Model-Conditioned Implicit Representation for Image-based Human Reconstruction [67.08350202974434]
We propose Parametric Model-Conditioned Implicit Representation (PaMIR), which combines the parametric body model with the free-form deep implicit function. We show that our method achieves state-of-the-art performance for image-based 3D human reconstruction in the cases of challenging poses and clothing types.
arXiv Detail & Related papers (2020-07-08T02:26:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.