Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes
- URL: http://arxiv.org/abs/2304.08993v1
- Date: Tue, 18 Apr 2023 13:55:24 GMT
- Title: Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes
- Authors: Rui Li, Dong Gong, Wei Yin, Hao Chen, Yu Zhu, Kaixuan Wang, Xiaozhi
Chen, Jinqiu Sun, Yanning Zhang
- Abstract summary: We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
- Score: 51.20150148066458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-frame depth estimation generally achieves high accuracy relying on the
multi-view geometric consistency. When applied in dynamic scenes, e.g.,
autonomous driving, this consistency is usually violated in the dynamic areas,
leading to corrupted estimations. Many multi-frame methods handle dynamic areas
by identifying them with explicit masks and compensating the multi-view cues
with monocular cues represented as local monocular depth or features. The
improvements are limited due to the uncontrolled quality of the masks and the
underutilized benefits of the fusion of the two types of cues. In this paper,
we propose a novel method to learn to fuse the multi-view and monocular cues
encoded as volumes without needing the heuristically crafted masks. As unveiled
in our analyses, the multi-view cues capture more accurate geometric
information in static areas, and the monocular cues capture more useful
contexts in dynamic areas. To let the geometric perception learned from
multi-view cues in static areas propagate to the monocular representation in
dynamic areas and let monocular cues enhance the representation of multi-view
cost volume, we propose a cross-cue fusion (CCF) module, which includes the
cross-cue attention (CCA) to encode the spatially non-local relative
intra-relations from each source to enhance the representation of the other.
Experiments on real-world datasets prove the significant effectiveness and
generalization ability of the proposed method.
Related papers
- A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior.
We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information.
We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos [14.965321452764355]
We introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views.
Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays.
Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics.
arXiv Detail & Related papers (2024-03-31T12:55:05Z) - GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular
Multi-Frame Depth Estimation [7.158264965010546]
We propose an efficient component for cue fusion in monocular multi-frame depth estimation.
We represent scene attributes in the form of super tokens without relying on precise shapes.
Our method achieves state-of-the-art performance on the KITTI dataset with efficient fusion speed.
arXiv Detail & Related papers (2023-09-29T08:43:16Z) - Multi-Spectral Image Stitching via Spatial Graph Reasoning [52.27796682972484]
We propose a spatial graph reasoning based multi-spectral image stitching method.
We embed multi-scale complementary features from the same view position into a set of nodes.
By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features.
arXiv Detail & Related papers (2023-07-31T15:04:52Z) - Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings.
We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z) - Attentive Multi-View Deep Subspace Clustering Net [4.3386084277869505]
We propose a novel Attentive Multi-View Deep Subspace Nets (AMVDSN)
Our proposed method seeks to find a joint latent representation that explicitly considers both consensus and view-specific information.
The experimental results on seven real-world data sets have demonstrated the effectiveness of our proposed algorithm against some state-of-the-art subspace learning approaches.
arXiv Detail & Related papers (2021-12-23T12:57:26Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - Self-Supervised Joint Learning Framework of Depth Estimation via
Implicit Cues [24.743099160992937]
We propose a novel self-supervised joint learning framework for depth estimation.
The proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.
arXiv Detail & Related papers (2020-06-17T13:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.