VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery
- URL: http://arxiv.org/abs/2505.02704v3
- Date: Sun, 13 Jul 2025 06:16:51 GMT
- Title: VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery
- Authors: Bojin Wu, Jing Chen,
- Abstract summary: VGLD (Visually-Guided Linguistic Disambiguation) is a framework that incorporates high-level visual semantics to resolve ambiguity in textual inputs.<n>By jointly encoding both image and text, VGLD predicts a set of global linear transformation parameters that align relative depth maps with metric scale.<n>Results show that VGLD significantly mitigates scale estimation bias caused by inconsistent or ambiguous language, achieving robust and accurate metric predictions.
- Score: 2.8834278113855896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular depth estimation can be broadly categorized into two directions: relative depth estimation, which predicts normalized or inverse depth without absolute scale, and metric depth estimation, which aims to recover depth with real-world scale. While relative methods are flexible and data-efficient, their lack of metric scale limits their utility in downstream tasks. A promising solution is to infer absolute scale from textual descriptions. However, such language-based recovery is highly sensitive to natural language ambiguity, as the same image may be described differently across perspectives and styles. To address this, we introduce VGLD (Visually-Guided Linguistic Disambiguation), a framework that incorporates high-level visual semantics to resolve ambiguity in textual inputs. By jointly encoding both image and text, VGLD predicts a set of global linear transformation parameters that align relative depth maps with metric scale. This visually grounded disambiguation improves the stability and accuracy of scale estimation. We evaluate VGLD on representative models, including MiDaS and DepthAnything, using standard indoor (NYUv2) and outdoor (KITTI) benchmarks. Results show that VGLD significantly mitigates scale estimation bias caused by inconsistent or ambiguous language, achieving robust and accurate metric predictions. Moreover, when trained on multiple datasets, VGLD functions as a universal and lightweight alignment module, maintaining strong performance even in zero-shot settings. Code will be released upon acceptance.
Related papers
- TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast [7.127920563966129]
Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE)<n>MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications.<n>Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level.
arXiv Detail & Related papers (2025-06-16T11:50:00Z) - Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image [51.689871870692194]
Metric-r is a novel sliding anchor-based metric depth estimation method.<n>Our design enables a unified and adaptive depth representation across diverse environments.
arXiv Detail & Related papers (2025-04-16T14:12:25Z) - RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions [47.614203035800735]
Inferring depth from a single image is an ill-posed problem due to the loss of scale from perspective projection.
Our goal is to recover metric-scaled depth maps through a linear transformation.
Our method, RSA, takes as input a text caption describing objects present in an image and outputs the parameters of a linear transformation.
arXiv Detail & Related papers (2024-10-03T19:18:13Z) - TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs [5.6168844664788855]
This work presents TanDepth, a practical scale recovery method for obtaining metric depth results from relative estimations at inference-time.<n>Our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view.<n>An adaptation to the Cloth Filter Simulation is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points.
arXiv Detail & Related papers (2024-09-08T15:54:43Z) - ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth.
Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module.
Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z) - DVMNet++: Rethinking Relative Pose Estimation for Unseen Objects [59.51874686414509]
Existing approaches typically predict 3D translation utilizing the ground-truth object bounding box and approximate 3D rotation with a large number of discrete hypotheses.<n>We present a Deep Voxel Matching Network (DVMNet++) that computes the relative object pose in a single pass.<n>Our approach delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-03-20T15:41:32Z) - Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation [31.34615135846137]
We propose a few-shot-based method which learns to adapt the Vision-Language Models for monocular depth estimation.
Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference.
With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6% in terms of MARE.
arXiv Detail & Related papers (2023-11-02T06:56:50Z) - Monocular Visual-Inertial Depth Estimation [66.71452943981558]
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual-inertial odometry.
Our approach performs global scale and shift alignment against sparse metric depth, followed by learning-based dense alignment.
We evaluate on the TartanAir and VOID datasets, observing up to 30% reduction in RMSE with dense scale alignment.
arXiv Detail & Related papers (2023-03-21T18:47:34Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data [110.29043712400912]
We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation.
Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting.
arXiv Detail & Related papers (2020-02-03T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.