Monocular Dynamic View Synthesis: A Reality Check
- URL: http://arxiv.org/abs/2210.13445v1
- Date: Mon, 24 Oct 2022 17:58:28 GMT
- Title: Monocular Dynamic View Synthesis: A Reality Check
- Authors: Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, Angjoo Kanazawa
- Abstract summary: We show a discrepancy between the practical capture process and the existing experimental protocols.
We define effective multi-view factors (EMFs) to quantify the amount of multi-view signal present in the input capture sequence.
We also propose a new iPhone dataset that includes more diverse real-life deformation sequences.
- Score: 45.438135525140154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the recent progress on dynamic view synthesis (DVS) from monocular
video. Though existing approaches have demonstrated impressive results, we show
a discrepancy between the practical capture process and the existing
experimental protocols, which effectively leaks in multi-view signals during
training. We define effective multi-view factors (EMFs) to quantify the amount
of multi-view signal present in the input capture sequence based on the
relative camera-scene motion. We introduce two new metrics: co-visibility
masked image metrics and correspondence accuracy, which overcome the issue in
existing protocols. We also propose a new iPhone dataset that includes more
diverse real-life deformation sequences. Using our proposed experimental
protocol, we show that the state-of-the-art approaches observe a 1-2 dB drop in
masked PSNR in the absence of multi-view cues and 4-5 dB drop when modeling
complex motion. Code and data can be found at https://hangg7.com/dycheck.
Related papers
- Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting [44.48514301889318]
This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach.
A large synthetic dataset is adopted to enhance the model's generalization ability.
Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance.
arXiv Detail & Related papers (2024-05-30T11:03:27Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - DVANet: Disentangling View and Action Features for Multi-View Action
Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z) - Efficient View Synthesis and 3D-based Multi-Frame Denoising with
Multiplane Feature Representations [1.18885605647513]
We introduce the first 3D-based multi-frame denoising method that significantly outperforms its 2D-based counterparts with lower computational requirements.
Our method extends the multiplane image (MPI) framework for novel view synthesis by introducing a learnable encoder-renderer pair manipulating multiplane in feature space.
arXiv Detail & Related papers (2023-03-31T15:23:35Z) - MoCo-Flow: Neural Motion Consensus Flow for Dynamic Humans in Stationary
Monocular Cameras [98.40768911788854]
We introduce MoCo-Flow, a representation that models the dynamic scene using a 4D continuous time-variant function.
At the heart of our work lies a novel optimization formulation, which is constrained by a motion consensus regularization on the motion flow.
We extensively evaluate MoCo-Flow on several datasets that contain human motions of varying complexity.
arXiv Detail & Related papers (2021-06-08T16:03:50Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - SeCo: Exploring Sequence Supervision for Unsupervised Representation
Learning [114.58986229852489]
In this paper, we explore the basic and generic supervision in the sequence from spatial, sequential and temporal perspectives.
We derive a particular form named Contrastive Learning (SeCo)
SeCo shows superior results under the linear protocol on action recognition, untrimmed activity recognition and object tracking.
arXiv Detail & Related papers (2020-08-03T15:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.