VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery
- URL: http://arxiv.org/abs/2505.02704v2
- Date: Tue, 06 May 2025 03:06:28 GMT
- Title: VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery
- Authors: Bojin Wu, Jing Chen,
- Abstract summary: VGLD stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the description.<n>This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map.<n>Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios.
- Score: 2.8834278113855896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a robust method for monocular depth scale recovery. Monocular depth estimation can be divided into two main directions: (1) relative depth estimation, which provides normalized or inverse depth without scale information, and (2) metric depth estimation, which involves recovering depth with absolute scale. To obtain absolute scale information for practical downstream tasks, utilizing textual information to recover the scale of a relative depth map is a highly promising approach. However, since a single image can have multiple descriptions from different perspectives or with varying styles, it has been shown that different textual descriptions can significantly affect the scale recovery process. To address this issue, our method, VGLD, stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the textual description. This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map, ultimately generating depth predictions with metric-scale accuracy. We validate our method across several popular relative depth models(MiDas, DepthAnything), using both indoor scenes (NYUv2) and outdoor scenes (KITTI). Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios. Code is available at: https://github.com/pakinwu/VGLD.
Related papers
- TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast [7.127920563966129]
Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE)<n>MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications.<n>Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level.
arXiv Detail & Related papers (2025-06-16T11:50:00Z) - Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image [51.689871870692194]
Metric-r is a novel sliding anchor-based metric depth estimation method.<n>Our design enables a unified and adaptive depth representation across diverse environments.
arXiv Detail & Related papers (2025-04-16T14:12:25Z) - RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions [47.614203035800735]
Inferring depth from a single image is an ill-posed problem due to the loss of scale from perspective projection.
Our goal is to recover metric-scaled depth maps through a linear transformation.
Our method, RSA, takes as input a text caption describing objects present in an image and outputs the parameters of a linear transformation.
arXiv Detail & Related papers (2024-10-03T19:18:13Z) - TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs [5.6168844664788855]
This work presents TanDepth, a practical scale recovery method for obtaining metric depth results from relative estimations at inference-time.<n>Our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view.<n>An adaptation to the Cloth Filter Simulation is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points.
arXiv Detail & Related papers (2024-09-08T15:54:43Z) - ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth.
Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module.
Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z) - DVMNet++: Rethinking Relative Pose Estimation for Unseen Objects [59.51874686414509]
Existing approaches typically predict 3D translation utilizing the ground-truth object bounding box and approximate 3D rotation with a large number of discrete hypotheses.<n>We present a Deep Voxel Matching Network (DVMNet++) that computes the relative object pose in a single pass.<n>Our approach delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-03-20T15:41:32Z) - Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation [31.34615135846137]
We propose a few-shot-based method which learns to adapt the Vision-Language Models for monocular depth estimation.
Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference.
With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6% in terms of MARE.
arXiv Detail & Related papers (2023-11-02T06:56:50Z) - Monocular Visual-Inertial Depth Estimation [66.71452943981558]
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual-inertial odometry.
Our approach performs global scale and shift alignment against sparse metric depth, followed by learning-based dense alignment.
We evaluate on the TartanAir and VOID datasets, observing up to 30% reduction in RMSE with dense scale alignment.
arXiv Detail & Related papers (2023-03-21T18:47:34Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data [110.29043712400912]
We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation.
Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting.
arXiv Detail & Related papers (2020-02-03T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.