Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for
Consistent Self-Supervised Monocular Depth Estimation
- URL: http://arxiv.org/abs/2110.08192v1
- Date: Fri, 15 Oct 2021 16:43:31 GMT
- Title: Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for
Consistent Self-Supervised Monocular Depth Estimation
- Authors: Patrick Ruhkamp, Daoyi Gao, Hanzhi Chen, Nassir Navab, Benjamin Busam
- Abstract summary: This paper explores how the increasingly popular transformer architecture, together with novel regularized loss formulations, can improve depth consistency.
We propose a spatial attention module that correlates coarse depth predictions to aggregate local geometric information.
A novel temporal attention mechanism further processes the local geometric information in a global context across consecutive images.
- Score: 42.249533907879126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inferring geometrically consistent dense 3D scenes across a tuple of
temporally consecutive images remains challenging for self-supervised monocular
depth prediction pipelines. This paper explores how the increasingly popular
transformer architecture, together with novel regularized loss formulations,
can improve depth consistency while preserving accuracy. We propose a spatial
attention module that correlates coarse depth predictions to aggregate local
geometric information. A novel temporal attention mechanism further processes
the local geometric information in a global context across consecutive images.
Additionally, we introduce geometric constraints between frames regularized by
photometric cycle consistency. By combining our proposed regularization and the
novel spatial-temporal-attention module we fully leverage both the geometric
and appearance-based consistency across monocular frames. This yields
geometrically meaningful attention and improves temporal depth stability and
accuracy compared to previous methods.
Related papers
- Geometry-Aware Rotary Position Embedding for Consistent Video World Model [48.914346802616414]
ViewRope is a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers.<n>Geometry-Aware Frame-Sparse Attention exploits these geometric cues to selectively attend to relevant historical frames.<n>Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
arXiv Detail & Related papers (2026-02-08T08:01:16Z) - GeoSurDepth: Spatial Geometry-Consistent Self-Supervised Depth Estimation for Surround-View Cameras [3.072321170197384]
GeoSurDepth is a framework that leverages geometry consistency as the primary cue for surround-view depth estimation.<n>Our framework highlights the importance of exploiting geometry coherence and consistency for robust self-supervised multi-view depth estimation.
arXiv Detail & Related papers (2026-01-09T15:13:28Z) - Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance [61.41904916189093]
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images.<n>We use hand-object interaction as geometric guidance to ensure plausible hand-object interactions.
arXiv Detail & Related papers (2025-08-25T17:11:53Z) - Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation [62.87088388345378]
We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology.<n>Method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images.<n>Cross-modal attention distillation is proposed to ensure accurate alignment between generated images and geometry.
arXiv Detail & Related papers (2025-06-13T16:19:00Z) - Geometry-Editable and Appearance-Preserving Object Compositon [67.98806888489385]
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties.<n>Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation.<n>We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion model that first leverages semantic embeddings to implicitly capture desired geometric transformations.
arXiv Detail & Related papers (2025-05-27T09:05:28Z) - Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection [62.27919334393825]
We propose a novel Geometry-aware Temporal Aggregation Network (GTA-Net) for monocular 3D lane detection.
On one hand, we develop the Temporal Geometry Enhancement Module (TGEM), which exploits geometric consistency across successive frames.
On the other hand, we present the Temporal Instance-aware Query Generation (TIQG), which strategically incorporates temporal cues into query generation.
arXiv Detail & Related papers (2025-04-29T08:10:17Z) - RDG-GS: Relative Depth Guidance with Gaussian Splatting for Real-time Sparse-View 3D Rendering [13.684624443214599]
We present RDG-GS, a novel sparse-view 3D rendering framework with Relative Depth Guidance based on 3D Gaussian Splatting.
The core innovation lies in utilizing relative depth guidance to refine the Gaussian field, steering it towards view-consistent spatial geometric representations.
Across extensive experiments on Mip-NeRF360, LLFF, DTU, and Blender, RDG-GS demonstrates state-of-the-art rendering quality and efficiency.
arXiv Detail & Related papers (2025-01-19T16:22:28Z) - Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction [61.484280369655536]
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.
Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning.
We introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP)
arXiv Detail & Related papers (2024-12-11T09:53:10Z) - Geometric Point Attention Transformer for 3D Shape Reassembly [17.34739330880715]
We present a network specifically designed to address the challenges of reasoning about geometric relationships.
We integrate both global shape information and local pairwise geometric features, along with poses represented as rotation and translation vectors for each part.
We evaluate our model on both the semantic and geometric assembly tasks, showing that it outperforms previous methods in absolute pose estimation.
arXiv Detail & Related papers (2024-11-26T15:29:38Z) - ND-SDF: Learning Normal Deflection Fields for High-Fidelity Indoor Reconstruction [50.07671826433922]
It is non-trivial to simultaneously recover meticulous geometry and preserve smoothness across regions with differing characteristics.
We propose ND-SDF, which learns a Normal Deflection field to represent the angular deviation between the scene normal and the prior normal.
Our method not only obtains smooth weakly textured regions such as walls and floors but also preserves the geometric details of complex structures.
arXiv Detail & Related papers (2024-08-22T17:59:01Z) - DoubleTake: Geometry Guided Depth Estimation [17.464549832122714]
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task.
We introduce a reconstruction which combines volume features with a hint of the prior geometry, rendered as a depth map from the current camera location.
We demonstrate that our method can run at interactive speeds, state-of-the-art estimates of depth and 3D scene in both offline and incremental evaluation scenarios.
arXiv Detail & Related papers (2024-06-26T14:29:05Z) - DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation [17.99904937160487]
DCPI-Depth is a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams.
It achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts.
arXiv Detail & Related papers (2024-05-27T08:55:17Z) - SGFormer: Spherical Geometry Transformer for 360 Depth Estimation [54.13459226728249]
Panoramic distortion poses a significant challenge in 360 depth estimation.
We propose a spherical geometry transformer, named SGFormer, to address the above issues.
We also present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions.
arXiv Detail & Related papers (2024-04-23T12:36:24Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - Adaptive Surface Normal Constraint for Geometric Estimation from Monocular Images [56.86175251327466]
We introduce a novel approach to learn geometries such as depth and surface normal from images while incorporating geometric context.
Our approach extracts geometric context that encodes the geometric variations present in the input image and correlates depth estimation with geometric constraints.
Our method unifies depth and surface normal estimations within a cohesive framework, which enables the generation of high-quality 3D geometry from images.
arXiv Detail & Related papers (2024-02-08T17:57:59Z) - Learning Monocular Depth in Dynamic Environment via Context-aware
Temporal Attention [9.837958401514141]
We present CTA-Depth, a Context-aware Temporal Attention guided network for multi-frame monocular Depth estimation.
Our approach achieves significant improvements over state-of-the-art approaches on three benchmark datasets.
arXiv Detail & Related papers (2023-05-12T11:48:32Z) - Few-shot Non-line-of-sight Imaging with Signal-surface Collaborative
Regularization [18.466941045530408]
Non-line-of-sight imaging technique aims to reconstruct targets from multiply reflected light.
We propose a signal-surface collaborative regularization framework that provides noise-robust reconstructions with a minimal number of measurements.
Our approach has great potential in real-time non-line-of-sight imaging applications such as rescue operations and autonomous driving.
arXiv Detail & Related papers (2022-11-21T11:19:20Z) - A Unifying and Canonical Description of Measure-Preserving Diffusions [60.59592461429012]
A complete recipe of measure-preserving diffusions in Euclidean space was recently derived unifying several MCMC algorithms into a single framework.
We develop a geometric theory that improves and generalises this construction to any manifold.
arXiv Detail & Related papers (2021-05-06T17:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.