Related papers: Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

URL: http://arxiv.org/abs/2509.15224v1
Date: Thu, 18 Sep 2025 17:59:51 GMT
Title: Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Authors: Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, Stefano Mattoccia,
Abstract summary: Event cameras capture sparse, high-temporal-resolution visual information.<n>The lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data.<n>We propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM)<n>Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs.
Score: 47.90167568304715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.

Related papers

EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition [54.55914886780534]
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion.<n>We introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR.<n>EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios.
arXiv Detail & Related papers (2026-02-13T13:25:05Z)
UDPNet: Unleashing Depth-based Priors for Robust Image Dehazing [77.10640210751981]
UDPNet is a general framework that leverages depth-based priors from a large-scale pretrained depth estimation model DepthAnything V2.<n>Our proposed solution establishes a new benchmark for depth-aware dehazing across various scenarios.
arXiv Detail & Related papers (2026-01-11T13:29:02Z)
UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block [6.994911870644179]
We propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features.<n>We show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.
arXiv Detail & Related papers (2025-07-26T13:29:48Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
RGB-Thermal Infrared Fusion for Robust Depth Estimation in Complex Environments [0.0]
This paper proposes a novel multimodal depth estimation model, RTFusion, which enhances depth estimation accuracy and robustness.<n>The model incorporates a unique fusion mechanism, EGFusion, consisting of the Mutual Complementary Attention (MCA) module for cross-modal feature alignment.<n>Experiments on the MS2 and ViViD++ datasets demonstrate that the proposed model consistently produces high-quality depth maps.
arXiv Detail & Related papers (2025-03-05T01:35:14Z)
Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion [16.673178271652553]
Self-supervised monocular depth estimation has received widespread attention because of its capability to train without ground truth.<n>We employ the generative-based diffusion model with a unique denoising training process for self-supervised monocular depth estimation.<n>We conduct experiments on the KITTI and Make3D datasets.
arXiv Detail & Related papers (2024-06-14T07:31:20Z)
PCDepth: Pattern-based Complementary Learning for Monocular Depth Estimation by Best of Both Worlds [15.823230141827358]
Event cameras record scene dynamics with high temporal resolution, providing rich scene details for monocular depth estimation. Existing complementary learning approaches for MDE fuse intensity information from images and scene details from event data for better scene understanding. We propose a Pattern-based Complementary learning architecture for monocular Depth estimation (PCDepth)
arXiv Detail & Related papers (2024-02-29T07:31:59Z)
Self-supervised Event-based Monocular Depth Estimation using Cross-modal Consistency [18.288912105820167]
We propose a self-supervised event-based monocular depth estimation framework named EMoDepth. EMoDepth constrains the training process using the cross-modal consistency from intensity frames that are aligned with events in the pixel coordinate. In inference, only events are used for monocular depth prediction.
arXiv Detail & Related papers (2024-01-14T07:16:52Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection. We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z)
Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames. Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction. We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.