Related papers: Monocular Depth Estimation with Global-Aware Discretization and Local Context Modeling

Monocular Depth Estimation with Global-Aware Discretization and Local Context Modeling

URL: http://arxiv.org/abs/2508.03186v1
Date: Tue, 05 Aug 2025 07:51:37 GMT
Title: Monocular Depth Estimation with Global-Aware Discretization and Local Context Modeling
Authors: Heng Wu, Qian Zhang, Guixu Zhang,
Abstract summary: We present a novel depth estimation method that combines both local and global cues to improve prediction accuracy.<n>Specifically, we propose the Gated Large Kernel Attention Module (GLKAM) to effectively capture multi-scale local structural information.<n>To further enhance the global perception of the network, we introduce the Global Bin Prediction Module (GBPM)
Score: 15.556824810217073
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate monocular depth estimation remains a challenging problem due to the inherent ambiguity that stems from the ill-posed nature of recovering 3D structure from a single view, where multiple plausible depth configurations can produce identical 2D projections. In this paper, we present a novel depth estimation method that combines both local and global cues to improve prediction accuracy. Specifically, we propose the Gated Large Kernel Attention Module (GLKAM) to effectively capture multi-scale local structural information by leveraging large kernel convolutions with a gated mechanism. To further enhance the global perception of the network, we introduce the Global Bin Prediction Module (GBPM), which estimates the global distribution of depth bins and provides structural guidance for depth regression. Extensive experiments on the NYU-V2 and KITTI dataset demonstrate that our method achieves competitive performance and outperforms existing approaches, validating the effectiveness of each proposed component.

Related papers

Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion [33.854696587141355]
We propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training.<n>Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions.<n>Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods.
arXiv Detail & Related papers (2025-08-07T02:38:24Z)
Double-Shot 3D Shape Measurement with a Dual-Branch Network for Structured Light Projection Profilometry [14.749887303860717]
We propose a dual-branch Convolutional Neural Network (CNN)-Transformer network (PDCNet) to process different structured light (SL) modalities.<n>Within PDCNet, a Transformer branch is used to capture global perception in the fringe images, while a CNN branch is designed to collect local details in the speckle images.<n>Our method can reduce fringe order ambiguity while producing high-accuracy results on self-made datasets.
arXiv Detail & Related papers (2024-07-19T10:49:26Z)
Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D Gaussian [49.21866794516328]
3D Gaussian splatting has demonstrated impressive performance in real-time novel view synthesis. Previous approaches have incorporated depth supervision into the training of 3D Gaussians to mitigate overfitting. We introduce a novel method to supervise the depth distribution of 3D Gaussians, utilizing depth priors with integrated uncertainty estimates.
arXiv Detail & Related papers (2024-05-30T03:18:30Z)
Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network [80.19054069988559]
We find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency. We propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth representation in two aspects. Experiments show that our method achieves significant improvements on three widely used benchmarks.
arXiv Detail & Related papers (2023-08-10T14:32:18Z)
Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset. We propose a novel attention-based architecture, Depthformer for monocular depth estimation. Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z)
Uniform Manifold Approximation with Two-phase Optimization [13.229510087215552]
We introduce Uniform Manifold Approximation with Two-phase Optimization (UMATO) to improve UMAP. UMATO is a dimensionality reduction (DR) technique that improves UMAP to capture the global structure of high-dimensional data more accurately.
arXiv Detail & Related papers (2022-05-01T08:19:52Z)
Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO) The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes. Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth [90.33296913575818]
In some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. We propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points. Our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-03T08:52:54Z)
Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth [24.897377434844266]
We propose a novel structure and training strategy for monocular depth estimation. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2.
arXiv Detail & Related papers (2022-01-19T06:37:21Z)
Bidirectional Attention Network for Monocular Depth Estimation [18.381967717929264]
Bidirectional Attention Network (BANet) is an end-to-end framework for monocular depth estimation (MDE) We introduce bidirectional attention modules that utilize the feed-forward feature maps and incorporate the global context to filter out ambiguity. We show that our proposed approach either outperforms or performs at least on a par with the state-of-the-art monocular depth estimation methods with less memory and computational complexity.
arXiv Detail & Related papers (2020-09-01T23:14:05Z)
Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction [72.30870535815258]
CNNs for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. We propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform pseudo RGB-D feature-based SLAM. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses.
arXiv Detail & Related papers (2020-04-22T16:31:59Z)
OmniSLAM: Omnidirectional Localization and Dense Mapping for Wide-baseline Multi-camera Systems [88.41004332322788]
We present an omnidirectional localization and dense mapping system for a wide-baseline multiview stereo setup with ultra-wide field-of-view (FOV) fisheye cameras. For more practical and accurate reconstruction, we first introduce improved and light-weighted deep neural networks for the omnidirectional depth estimation. We integrate our omnidirectional depth estimates into the visual odometry (VO) and add a loop closing module for global consistency.
arXiv Detail & Related papers (2020-03-18T05:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.