PatchFusion: An End-to-End Tile-Based Framework for High-Resolution
Monocular Metric Depth Estimation
- URL: http://arxiv.org/abs/2312.02284v1
- Date: Mon, 4 Dec 2023 19:03:12 GMT
- Title: PatchFusion: An End-to-End Tile-Based Framework for High-Resolution
Monocular Metric Depth Estimation
- Authors: Zhenyu Li, Shariq Farooq Bhat, Peter Wonka
- Abstract summary: Single image depth estimation is a foundational task in computer vision and generative modeling.
We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art.
Experiments on UnrealStereo4K, MVS- Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details.
- Score: 47.53810786827547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single image depth estimation is a foundational task in computer vision and
generative modeling. However, prevailing depth estimation models grapple with
accommodating the increasing resolutions commonplace in today's consumer
cameras and devices. Existing high-resolution strategies show promise, but they
often face limitations, ranging from error propagation to the loss of
high-frequency details. We present PatchFusion, a novel tile-based framework
with three key components to improve the current state of the art: (1) A
patch-wise fusion network that fuses a globally-consistent coarse prediction
with finer, inconsistent tiled predictions via high-level feature guidance, (2)
A Global-to-Local (G2L) module that adds vital context to the fusion network,
discarding the need for patch selection heuristics, and (3) A Consistency-Aware
Training (CAT) and Inference (CAI) approach, emphasizing patch overlap
consistency and thereby eradicating the necessity for post-processing.
Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that
our framework can generate high-resolution depth maps with intricate details.
PatchFusion is independent of the base model for depth estimation. Notably, our
framework built on top of SOTA ZoeDepth brings improvements for a total of
17.3% and 29.4% in terms of the root mean squared error (RMSE) on
UnrealStereo4K and MVS-Synth, respectively.
Related papers
- DEFOM-Stereo: Depth Foundation Model Based Stereo Matching [12.22373236061929]
DEFOM-Stereo is built to facilitate robust stereo matching with monocular depth cues.
DEFOM-Stereo is verified to have comparable performance on the Scene Flow dataset with state-of-the-art (SOTA) methods.
Our model simultaneously outperforms previous models on the individual benchmarks.
arXiv Detail & Related papers (2025-01-16T10:59:29Z) - Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.
Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.
We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - C2F2NeUS: Cascade Cost Frustum Fusion for High Fidelity and
Generalizable Neural Surface Reconstruction [12.621233209149953]
We introduce a novel integration scheme that combines the multi-view stereo with neural signed distance function representations.
Our method reconstructs robust surfaces and outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2023-06-16T17:56:16Z) - Searching a Compact Architecture for Robust Multi-Exposure Image Fusion [55.37210629454589]
Two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference.
This study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion.
The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z) - GARNet: Global-Aware Multi-View 3D Reconstruction Network and the
Cost-Performance Tradeoff [10.8606881536924]
We propose a global-aware attention-based fusion approach that builds the correlation between each branch and the global to provide a comprehensive foundation for weights inference.
In order to enhance the ability of the network, we introduce a novel loss function to supervise the shape overall.
Experiments on ShapeNet verify that our method outperforms existing SOTA methods.
arXiv Detail & Related papers (2022-11-04T07:45:19Z) - On Robust Cross-View Consistency in Self-Supervised Monocular Depth Estimation [56.97699793236174]
We study two kinds of robust cross-view consistency in this paper.
We exploit the temporal coherence in both depth feature space and 3D voxel space for self-supervised monocular depth estimation.
Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques.
arXiv Detail & Related papers (2022-09-19T03:46:13Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation [14.81943833870932]
We present an improvedDepthNet, HR-Depth, with two effective strategies.
Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution.
arXiv Detail & Related papers (2020-12-14T09:15:15Z) - Fusion of Range and Stereo Data for High-Resolution Scene-Modeling [20.824550995195057]
This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps.
We combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation.
The accuracy of the method is not compromised, owing to three properties of the data-term in the energy function.
arXiv Detail & Related papers (2020-12-12T09:37:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.