Related papers: TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast

URL: http://arxiv.org/abs/2506.13387v1
Date: Mon, 16 Jun 2025 11:50:00 GMT
Title: TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast
Authors: Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren,
Abstract summary: Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE)<n>MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications.<n>Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level.
Score: 7.127920563966129
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)

Related papers

Depth Anything with Any Prior [64.39991799606146]
Prior Depth Anything is a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction.<n>We develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors.<n>Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets.
arXiv Detail & Related papers (2025-05-15T17:59:50Z)
VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery [2.8834278113855896]
VGLD stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the description.<n>This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map.<n>Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios.
arXiv Detail & Related papers (2025-05-05T14:57:16Z)
Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image [51.689871870692194]
Metric-r is a novel sliding anchor-based metric depth estimation method.<n>Our design enables a unified and adaptive depth representation across diverse environments.
arXiv Detail & Related papers (2025-04-16T14:12:25Z)
RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions [47.614203035800735]
Inferring depth from a single image is an ill-posed problem due to the loss of scale from perspective projection. Our goal is to recover metric-scaled depth maps through a linear transformation. Our method, RSA, takes as input a text caption describing objects present in an image and outputs the parameters of a linear transformation.
arXiv Detail & Related papers (2024-10-03T19:18:13Z)
Self-supervised Monocular Depth Estimation with Large Kernel Attention [30.44895226042849]
We propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies. Our method achieves competitive results on the KITTI dataset.
arXiv Detail & Related papers (2024-09-26T14:44:41Z)
ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module. Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z)
Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation [34.786268652516355]
Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data. We propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning.
arXiv Detail & Related papers (2023-11-21T15:39:21Z)
Monocular Visual-Inertial Depth Estimation [66.71452943981558]
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual-inertial odometry. Our approach performs global scale and shift alignment against sparse metric depth, followed by learning-based dense alignment. We evaluate on the TartanAir and VOID datasets, observing up to 30% reduction in RMSE with dense scale alignment.
arXiv Detail & Related papers (2023-03-21T18:47:34Z)
Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO) The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes. Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.