EPIC Fields: Marrying 3D Geometry and Video Understanding
- URL: http://arxiv.org/abs/2306.08731v2
- Date: Thu, 1 Feb 2024 09:59:34 GMT
- Title: EPIC Fields: Marrying 3D Geometry and Video Understanding
- Authors: Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro
Laina, Diane Larlus, Dima Damen, Andrea Vedaldi
- Abstract summary: EPIC Fields is an augmentation of EPIC-KITCHENS with 3D camera information.
It removes the complex and expensive step of reconstructing cameras using photogrammetry.
It reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
- Score: 76.60638761589065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural rendering is fuelling a unification of learning, 3D geometry and video
understanding that has been waiting for more than two decades. Progress,
however, is still hampered by a lack of suitable datasets and benchmarks. To
address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS
with 3D camera information. Like other datasets for neural rendering, EPIC
Fields removes the complex and expensive step of reconstructing cameras using
photogrammetry, and allows researchers to focus on modelling problems. We
illustrate the challenge of photogrammetry in egocentric videos of dynamic
actions and propose innovations to address them. Compared to other neural
rendering datasets, EPIC Fields is better tailored to video understanding
because it is paired with labelled action segments and the recent VISOR segment
annotations. To further motivate the community, we also evaluate two benchmark
tasks in neural rendering and segmenting dynamic objects, with strong baselines
that showcase what is not possible today. We also highlight the advantage of
geometry in semi-supervised video object segmentations on the VISOR
annotations. EPIC Fields reconstructs 96% of videos in EPICKITCHENS,
registering 19M frames in 99 hours recorded in 45 kitchens.
Related papers
- 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z) - EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations [83.26326325568208]
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video.
Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions.
VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality.
arXiv Detail & Related papers (2022-09-26T23:03:26Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.