Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer
- URL: http://arxiv.org/abs/2406.08928v1
- Date: Thu, 13 Jun 2024 08:51:57 GMT
- Title: Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer
- Authors: Guodong Sun, Junjie Liu, Mingxuan Liu, Moyun Liu, Yang Zhang,
- Abstract summary: Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data.
Lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately.
We introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions.
- Score: 12.486504395099022
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{https://github.com/MVME-HBUT/MPRLNet}
Related papers
- Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - Robust Geometry-Preserving Depth Estimation Using Differentiable
Rendering [93.94371335579321]
We propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations.
Comprehensive experiments underscore our framework's superior generalization capabilities.
Our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients.
arXiv Detail & Related papers (2023-09-18T12:36:39Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image.
Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales.
It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z) - Depth Field Networks for Generalizable Multi-view Scene Representation [31.090289865520475]
We learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity.
Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.
arXiv Detail & Related papers (2022-07-28T17:59:31Z) - BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation [46.678016537618845]
We present a novel framework called BinsFormer, tailored for the classification-regression-based depth estimation.
It mainly focuses on two crucial components in the specific task: 1) proper generation of adaptive bins and 2) sufficient interaction between probability distribution and bins predictions.
Experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that BinsFormer surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-04-03T04:38:02Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.