Related papers: MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction

MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction

URL: http://arxiv.org/abs/2512.11301v1
Date: Fri, 12 Dec 2025 05:54:19 GMT
Title: MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction
Authors: Bate Li, Houqiang Zhong, Zhengxue Cheng, Qiang Hu, Qiang Wang, Li Song, Wenjun Zhang,
Abstract summary: We present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction.<n>The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation.<n>Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications.
Score: 23.428989479526336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research.

Related papers

MV-TAP: Tracking Any Point in Multi-View Videos [34.91357343992975]
MV-TAP is a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information.<n>To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking.
arXiv Detail & Related papers (2025-12-01T18:59:01Z)
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views [5.723697351415207]
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives.<n>Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen.<n>The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions.
arXiv Detail & Related papers (2025-10-26T13:27:59Z)
PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z)
EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z)
ImViD: Immersive Volumetric Videos for Enhanced VR Engagement [34.450247091615395]
Next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents.<n>We introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios.<n>Our capture rig supports multi-view video-audio capture while on the move, significantly enhancing the completeness, flexibility, and efficiency of data capture.
arXiv Detail & Related papers (2025-03-18T15:42:22Z)
Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z)
EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z)
Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks. Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z)
3D Human Pose Perception from Egocentric Stereo Videos [67.9563319914377]
We propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.
arXiv Detail & Related papers (2023-12-30T21:21:54Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.