PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion
- URL: http://arxiv.org/abs/2509.26008v1
- Date: Tue, 30 Sep 2025 09:38:59 GMT
- Title: PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion
- Authors: Zhiwei Zhang, Ruikai Xu, Weijian Zhang, Zhizhong Zhang, Xin Tan, Jingyu Gong, Yuan Xie, Lizhuang Ma,
- Abstract summary: We present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth.<n> PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics.<n>We show that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks.
- Score: 61.6340987158734
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.
Related papers
- UDPNet: Unleashing Depth-based Priors for Robust Image Dehazing [77.10640210751981]
UDPNet is a general framework that leverages depth-based priors from a large-scale pretrained depth estimation model DepthAnything V2.<n>Our proposed solution establishes a new benchmark for depth-aware dehazing across various scenarios.
arXiv Detail & Related papers (2026-01-11T13:29:02Z) - Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision [12.972772139292957]
This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision.<n>By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method.
arXiv Detail & Related papers (2025-11-13T13:51:16Z) - FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers [91.59069344768858]
We introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder.<n>FreqPDE combines the 2D image features and 3D position embeddings to generate 3D depth-aware features for query decoding.
arXiv Detail & Related papers (2025-10-17T07:36:54Z) - DepthFusion: Depth-Aware Hybrid Feature Fusion for LiDAR-Camera 3D Object Detection [32.07206206508925]
State-of-the-art LiDAR-camera 3D object detectors usually focus on feature fusion.<n>We are the first to observe that different modalities play different roles as depth varies via statistical analysis and visualization.<n>We propose a Depth-Aware Hybrid Feature Fusion strategy that guides the weights of point cloud and RGB image modalities.
arXiv Detail & Related papers (2025-05-12T09:53:00Z) - Pillar-Voxel Fusion Network for 3D Object Detection in Airborne Hyperspectral Point Clouds [35.24778377226701]
We propose PiV-A HPC, a 3D object detection network for airborne HPCs.<n>We first develop a pillar-voxel dual-branch encoder, where the former captures spectral and vertical structural features from HPCs to overcome spectral distortion.<n>A multi-level feature fusion mechanism is devised to enhance information interaction between the two branches.
arXiv Detail & Related papers (2025-04-13T10:13:48Z) - MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas [49.891712558113845]
We introduce Multi-Cylindrical Panoramic Depth Estimation (MCPDepth)<n>MCPDepth is a two-stage framework designed to enhance omnidirectional depth estimation.<n>Our method improves the mean absolute error (MAE) by 18.8% on the outdoor dataset Deep360 and by 19.9% on the real dataset 3D60.
arXiv Detail & Related papers (2024-08-03T03:35:37Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z) - VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction [71.83308989022635]
In this paper, we advocate that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results.
Our network operates in two steps: 1) the local computation of the local depth maps with a deep MVS technique, and, 2) the depth maps and images' features fusion to build a single TSDF volume.
In order to improve the matching performance between images acquired from very different viewpoints, we introduce a rotation-invariant 3D convolution kernel called PosedConv.
arXiv Detail & Related papers (2021-08-19T11:33:58Z) - A Single Stream Network for Robust and Real-time RGB-D Salient Object
Detection [89.88222217065858]
We design a single stream network to use the depth map to guide early fusion and middle fusion between RGB and depth.
This model is 55.5% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 times 384$ image.
arXiv Detail & Related papers (2020-07-14T04:40:14Z) - DeepRelativeFusion: Dense Monocular SLAM using Single-Image Relative
Depth Prediction [4.9188958016378495]
We propose a dense monocular SLAM system, named DeepFusion, that is capable of recovering a globally consistent 3D structure.
We use a visual SLAM to reliably recover the camera poses and semi-dense maps of depth thes, and then use relative depth prediction to densify the semi-dense depth maps and refine the pose-graph.
Our system outperforms the state-of-the-art dense SLAM systems quantitatively in dense reconstruction accuracy by a large margin.
arXiv Detail & Related papers (2020-06-07T05:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.