Related papers: Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models

Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models

URL: http://arxiv.org/abs/2507.02148v2
Date: Thu, 10 Jul 2025 14:55:57 GMT
Title: Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models
Authors: Zijie Cai, Christopher Metzler,
Abstract summary: We present a benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets.<n>Our results show that large-scale models trained on terrestrial data (real or synthetic) are effective in in-air settings, but perform poorly underwater.<n>This study presents a detailed evaluation and visualization of monocular metric depth estimation in underwater scenes.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Monocular depth estimation has recently progressed beyond ordinal depth to provide metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, including FLSea and SQUID. We evaluated a diverse set of state-of-the-art Vision Foundation Models across a range of underwater conditions and depth ranges. Our results show that large-scale models trained on terrestrial data (real or synthetic) are effective in in-air settings, but perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S backbone encoder on a synthetic underwater variant of the Hypersim dataset, which we simulated using a physically based underwater image formation model. Our fine-tuned model consistently improves performance across all benchmarks and outperforms baselines trained only on the clean in-air Hypersim dataset. This study presents a detailed evaluation and visualization of monocular metric depth estimation in underwater scenes, emphasizing the importance of domain adaptation and scale-aware supervision for achieving robust and generalizable metric depth predictions using foundation models in challenging environments.

Related papers

Tree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation [85.17735565146106]
Underwater Monocular Depth Estimation (UMDE) is a critical task that aims to estimate high-precision depth maps from underwater degraded images.<n>We develop a novel tree-aware Mamba method, dubbed Tree-Mamba, for estimating accurate monocular depth maps from underwater degraded images.<n>We construct an underwater depth estimation benchmark (called BlueDepth), which consists of 38,162 underwater image pairs with reliable depth labels.
arXiv Detail & Related papers (2025-07-10T12:10:51Z)
AQUA20: A Benchmark Dataset for Underwater Species Classification under Challenging Conditions [1.2289361708127877]
This paper introduces AQUA20, a comprehensive benchmark dataset comprising 8,171 underwater images across 20 marine species.<n>Thirteen state-of-the-art deep learning models were evaluated to benchmark their performance in classifying marine species under challenging conditions.<n>Results show ConvNeXt achieving the best performance, with a Top-3 accuracy of 98.82% and a Top-1 accuracy of 90.69%, as well as the highest overall F1-score of 88.92% with moderately large parameter size.
arXiv Detail & Related papers (2025-06-20T19:54:35Z)
Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation [31.47797579690604]
We present Plenodium, a 3D representation framework capable of jointly modeling both objects and participating media.<n>In contrast to existing medium representations that rely solely on view-dependent modeling, our novel plenoptic medium representation incorporates both directional and positional information.<n>Experiments on real-world underwater datasets demonstrate that our method achieves significant improvements in 3D reconstruction.
arXiv Detail & Related papers (2025-05-27T14:37:58Z)
UWSAM: Segment Anything Model Guided Underwater Instance Segmentation and A Large-scale Benchmark Dataset [62.00529957144851]
We propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories.<n>We then introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances.<n>We show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets.
arXiv Detail & Related papers (2025-05-21T14:36:01Z)
Dense Geometry Supervision for Underwater Depth Estimation [0.0]
This paper proposes a novel approach to address the existing challenges in monocular depth estimation methods for underwater environments.<n>We construct an economically efficient dataset suitable for underwater scenarios by employing multi-view depth estimation.<n>We introduce a texture-depth fusion module, which aims to effectively exploit and integrate depth information from texture cues.
arXiv Detail & Related papers (2025-04-25T10:27:25Z)
Distilling Monocular Foundation Model for Fine-grained Depth Completion [17.603217168518356]
We propose a two-stage knowledge distillation framework to provide dense supervision for depth completion.<n>In the first stage, we generate diverse training data from natural images, which distills geometric knowledge to depth completion.<n>In the second stage, we employ a scale- and shift-invariant loss to learn real-world scales when fine-tuning on real-world datasets.
arXiv Detail & Related papers (2025-03-21T09:34:01Z)
FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation [65.01601309903971]
We introduce FAFA, a Frequency-Aware Flow-Aided self-supervised framework for 6D pose estimation of unmanned underwater vehicles (UUVs) Our framework relies solely on the 3D model and RGB images, alleviating the need for any real pose annotations or other-modality data like depths. We evaluate the effectiveness of FAFA on common underwater object pose benchmarks and showcase significant performance improvements compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-09-25T03:54:01Z)
UMono: Physical Model Informed Hybrid CNN-Transformer Framework for Underwater Monocular Depth Estimation [5.596432047035205]
Underwater monocular depth estimation serves as the foundation for tasks such as 3D reconstruction of underwater scenes. Existing methods fail to consider the unique characteristics of underwater environments. In this paper, an end-to-end learning framework for underwater monocular depth estimation called UMono is presented.
arXiv Detail & Related papers (2024-07-25T07:52:11Z)
ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module. Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z)
A Physical Model-Guided Framework for Underwater Image Enhancement and Depth Estimation [20.349103580702028]
Existing underwater image enhancement approaches fail to accurately estimate imaging model parameters such as depth and veiling light.<n>We propose a model-guided framework for jointly training a Deep Degradation Model with any advanced UIE model.<n>Our framework achieves remarkable enhancement results across diverse underwater scenes.
arXiv Detail & Related papers (2024-07-05T03:10:13Z)
Diving into Underwater: Segment Anything Model Guided Underwater Salient Instance Segmentation and A Large-scale Dataset [60.14089302022989]
Underwater vision tasks often suffer from low segmentation accuracy due to the complex underwater circumstances. We construct the first large-scale underwater salient instance segmentation dataset (USIS10K) We propose an Underwater Salient Instance architecture based on Segment Anything Model (USIS-SAM) specifically for the underwater domain.
arXiv Detail & Related papers (2024-06-10T06:17:33Z)
Depth-aware Volume Attention for Texture-less Stereo Matching [67.46404479356896]
We propose a lightweight volume refinement scheme to tackle the texture deterioration in practical outdoor scenarios. We introduce a depth volume supervised by the ground-truth depth map, capturing the relative hierarchy of image texture. Local fine structure and context are emphasized to mitigate ambiguity and redundancy during volume aggregation.
arXiv Detail & Related papers (2024-02-14T04:07:44Z)
Improving Underwater Visual Tracking With a Large Scale Dataset and Image Enhancement [70.2429155741593]
This paper presents a new dataset and general tracker enhancement method for Underwater Visual Object Tracking (UVOT) It poses distinct challenges; the underwater environment exhibits non-uniform lighting conditions, low visibility, lack of sharpness, low contrast, camouflage, and reflections from suspended particles. We propose a novel underwater image enhancement algorithm designed specifically to boost tracking quality. The method has resulted in a significant performance improvement, of up to 5.0% AUC, of state-of-the-art (SOTA) visual trackers.
arXiv Detail & Related papers (2023-08-30T07:41:26Z)
Monocular Visual-Inertial Depth Estimation [66.71452943981558]
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual-inertial odometry. Our approach performs global scale and shift alignment against sparse metric depth, followed by learning-based dense alignment. We evaluate on the TartanAir and VOID datasets, observing up to 30% reduction in RMSE with dense scale alignment.
arXiv Detail & Related papers (2023-03-21T18:47:34Z)
An evaluation of deep learning models for predicting water depth evolution in urban floods [59.31940764426359]
We compare different deep learning models for prediction of water depth at high spatial resolution. Deep learning models are trained to reproduce the data simulated by the CADDIES cellular-automata flood model. Our results show that the deep learning models present in general lower errors compared to the other methods.
arXiv Detail & Related papers (2023-02-20T16:08:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.