Related papers: Benchmark on Monocular Metric Depth Estimation in Wildlife Setting

Benchmark on Monocular Metric Depth Estimation in Wildlife Setting

URL: http://arxiv.org/abs/2510.04723v1
Date: Mon, 06 Oct 2025 11:43:34 GMT
Title: Benchmark on Monocular Metric Depth Estimation in Wildlife Setting
Authors: Niccolò Niccoli, Lorenzo Seidenari, Ilaria Greco, Francesco Rovero,
Abstract summary: This work introduces the first benchmark for monocular metric depth estimation in wildlife monitoring conditions.<n>We evaluate four state-of-the-art MDE methods (Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D) alongside a geometric baseline on 93 camera trap images.<n>Our results demonstrate that Depth Anything V2 achieves the best overall performance with a mean absolute error of 0.454m and correlation of 0.962.
Score: 5.296470528744146
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Camera traps are widely used for wildlife monitoring, but extracting accurate distance measurements from monocular images remains challenging due to the lack of depth information. While monocular depth estimation (MDE) methods have advanced significantly, their performance in natural wildlife environments has not been systematically evaluated. This work introduces the first benchmark for monocular metric depth estimation in wildlife monitoring conditions. We evaluate four state-of-the-art MDE methods (Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D) alongside a geometric baseline on 93 camera trap images with ground truth distances obtained using calibrated ChARUCO patterns. Our results demonstrate that Depth Anything V2 achieves the best overall performance with a mean absolute error of 0.454m and correlation of 0.962, while methods like ZoeDepth show significant degradation in outdoor natural environments (MAE: 3.087m). We find that median-based depth extraction consistently outperforms mean-based approaches across all deep learning methods. Additionally, we analyze computational efficiency, with ZoeDepth being fastest (0.17s per image) but least accurate, while Depth Anything V2 provides an optimal balance of accuracy and speed (0.22s per image). This benchmark establishes performance baselines for wildlife applications and provides practical guidance for implementing depth estimation in conservation monitoring systems.

Related papers

Deep in the Jungle: Towards Automating Chimpanzee Population Estimation [8.705217788065593]
estimation of abundance and density in unmarked populations of great apes relies on statistical frameworks that require animal-to-camera distance measurements.<n>This study introduces and evaluates an only sparsely explored alternative: the integration of computer vision-based monocular depth estimation pipelines directly into ecological camera trap distances for great ape conservation.<n>Using a real-world dataset of 220 camera trap videos documenting a wild chimpanzee population, we combine two MDE models, Dense Prediction Transformers and Depth Anything, with multiple distance sampling strategies.<n>The proposed approach yields population estimates within 22% of those obtained using traditional methods.
arXiv Detail & Related papers (2026-01-30T12:40:47Z)
OrchardDepth: Precise Metric Depth Estimation of Orchard Scene from Monocular Camera Images [3.3152016226925913]
We propose OrchardDepth, which fills the gap in the estimation of the metric depth of the monocular camera in the orchard/vineyard environment.<n>In addition, we present a new retraining method to improve the training result by monitoring the consistent regularization between dense depth maps and sparse points.
arXiv Detail & Related papers (2025-02-20T05:40:56Z)
A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation [46.037640130193566]
We propose a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU.<n>Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches.
arXiv Detail & Related papers (2024-12-18T17:50:15Z)
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion [57.08169927189237]
Existing methods for depth completion operate in tightly constrained settings.<n>Inspired by advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation.<n>Marigold-DC builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance.
arXiv Detail & Related papers (2024-12-18T00:06:41Z)
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second [45.6690958201871]
We present a foundation model for zero-shot metric monocular depth estimation.<n>Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details.<n>It produces a 2.25-megapixel depth map in 0.3 seconds on a standard GPU.
arXiv Detail & Related papers (2024-10-02T22:42:20Z)
ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module. Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z)
Uncertainty Guided Depth Fusion for Spike Camera [49.41822923588663]
We propose a novel Uncertainty-Guided Depth Fusion (UGDF) framework to fuse predictions of monocular and stereo depth estimation networks for spike camera. Our framework is motivated by the fact that stereo spike depth estimation achieves better results at close range. In order to demonstrate the advantage of spike depth estimation over traditional camera depth estimation, we contribute a spike-depth dataset named CitySpike20K.
arXiv Detail & Related papers (2022-08-26T13:04:01Z)
Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth [90.33296913575818]
In some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. We propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points. Our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-03T08:52:54Z)
Robust Consistent Video Depth Estimation [65.53308117778361]
We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video. Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details. In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations.
arXiv Detail & Related papers (2020-12-10T18:59:48Z)
D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry [57.5549733585324]
D3VO is a novel framework for monocular visual odometry that exploits deep networks on three levels -- deep depth, pose and uncertainty estimation. We first propose a novel self-supervised monocular depth estimation network trained on stereo videos without any external supervision. We model the photometric uncertainties of pixels on the input images, which improves the depth estimation accuracy.
arXiv Detail & Related papers (2020-03-02T17:47:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.