Related papers: CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

URL: http://arxiv.org/abs/2511.16428v1
Date: Thu, 20 Nov 2025 14:55:28 GMT
Title: CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
Authors: Samer Abualhanud, Christian Grannemann, Max Mehltretter,
Abstract summary: Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360 field of view from multiple minimally overlapping images.<n>Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images.<n>We propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth.
Score: 0.9558392439655014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

Related papers

FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers [91.59069344768858]
We introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder.<n>FreqPDE combines the 2D image features and 3D position embeddings to generate 3D depth-aware features for query decoding.
arXiv Detail & Related papers (2025-10-17T07:36:54Z)
Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering [6.372979654151044]
Current state-of-the-art monocular depth estimators, trained on extensive datasets, generalize well but lack 3D consistency needed for many applications.<n>In this paper, we combine the strength of those generalizing monocular depth estimation techniques with multi-view data by framing this as an analysis-by-synthesis optimization problem.<n>Our method is able to generate detailed, high-quality, view consistent, accurate depth maps, also in challenging indoor scenarios, and outperforms state-of-the-art multi-view depth reconstruction approaches on such datasets.
arXiv Detail & Related papers (2024-10-04T18:50:28Z)
Robust and Flexible Omnidirectional Depth Estimation with Multiple 360-degree Cameras [8.850391039025077]
We use geometric constraints and redundant information of multiple 360-degree cameras to achieve robust and flexible omnidirectional depth estimation.<n>Our two algorithms achieve state-of-the-art performance, accurately predicting depth maps even when provided with soiled panorama inputs.
arXiv Detail & Related papers (2024-09-23T07:31:48Z)
Calibrating Panoramic Depth Estimation for Practical Localization and Mapping [20.621442016969976]
The absolute depth values of surrounding environments provide crucial cues for various assistive technologies, such as localization, navigation, and 3D structure estimation. We propose that accurate depth estimated from panoramic images can serve as a powerful and light-weight input for a wide range of downstream tasks requiring 3D information.
arXiv Detail & Related papers (2023-08-27T04:50:05Z)
Understanding Depth Map Progressively: Adaptive Distance Interval Separation for Monocular 3d Object Detection [38.96129204108353]
Several monocular 3D detection techniques rely on auxiliary depth maps from the depth estimation task. We propose a framework named the Adaptive Distance Interval Separation Network (ADISN) that adopts a novel perspective on understanding depth maps.
arXiv Detail & Related papers (2023-06-19T13:32:53Z)
Multi-Camera Collaborative Depth Prediction via Consistent Structure Estimation [75.99435808648784]
We propose a novel multi-camera collaborative depth prediction method. It does not require large overlapping areas while maintaining structure consistency between cameras. Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2022-10-05T03:44:34Z)
GeoFill: Reference-Based Image Inpainting of Scenes with Complex Geometry [40.68659515139644]
Reference-guided image inpainting restores image pixels by leveraging the content from another reference image. We leverage a monocular depth estimate and predict relative pose between cameras, then align the reference image to the target by a differentiable 3D reprojection. Our approach achieves state-of-the-art performance on both RealEstate10K and MannequinChallenge dataset with large baselines, complex geometry and extreme camera motions.
arXiv Detail & Related papers (2022-01-20T12:17:13Z)
Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views. We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z)
Occlusion-Aware Depth Estimation with Adaptive Normal Constraints [85.44842683936471]
We present a new learning-based method for multi-frame depth estimation from a color video. Our method outperforms the state-of-the-art in terms of depth estimation accuracy.
arXiv Detail & Related papers (2020-04-02T07:10:45Z)
Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images [59.906948203578544]
We introduce a novel learning-based method to reconstruct the high-quality geometry and complex, spatially-varying BRDF of an arbitrary object. We first estimate per-view depth maps using a deep multi-view stereo network. These depth maps are used to coarsely align the different views. We propose a novel multi-view reflectance estimation network architecture.
arXiv Detail & Related papers (2020-03-27T21:28:54Z)
Depth Completion Using a View-constrained Deep Prior [73.21559000917554]
Recent work has shown that the structure of convolutional neural networks (CNNs) induces a strong prior that favors natural images. This prior, known as a deep image prior (DIP), is an effective regularizer in inverse problems such as image denoising and inpainting. We extend the concept of the DIP to depth images. Given color images and noisy and incomplete target depth maps, we reconstruct a depth map restored by virtue of using the CNN network structure as a prior.
arXiv Detail & Related papers (2020-01-21T21:56:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.