ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
- URL: http://arxiv.org/abs/2503.14359v1
- Date: Tue, 18 Mar 2025 15:42:22 GMT
- Title: ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
- Authors: Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu,
- Abstract summary: Next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents.<n>We introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios.<n>Our capture rig supports multi-view video-audio capture while on the move, significantly enhancing the completeness, flexibility, and efficiency of data capture.
- Score: 34.450247091615395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
Related papers
- FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications.
Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings.
We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z) - AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task.<n>Our framework incorporates two key components for video understanding and cross-modal learning.<n>Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z) - Den-SOFT: Dense Space-Oriented Light Field DataseT for 6-DOF Immersive Experience [28.651514326042648]
We have built a custom mobile multi-camera large-space dense light field capture system.
Our aim is to contribute to the development of popular 3D scene reconstruction algorithms.
The collected dataset is much denser than existing datasets.
arXiv Detail & Related papers (2024-03-15T02:39:44Z) - EasyVolcap: Accelerating Neural Volumetric Video Research [69.59671164891725]
Volumetric video is a technology that digitally records dynamic events such as artistic performances, sporting events, and remote conversations.
EasyVolcap is a Python & Pytorch library for unifying the process of multi-view data processing, 4D scene reconstruction, and efficient dynamic volumetric video rendering.
arXiv Detail & Related papers (2023-12-11T17:59:46Z) - NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z) - FSVVD: A Dataset of Full Scene Volumetric Video [2.9151420469958533]
In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset.
Comprehensive dataset description and analysis are conducted, with potential usage of this dataset.
arXiv Detail & Related papers (2023-03-07T02:31:08Z) - DeepMultiCap: Performance Capture of Multiple Characters Using Sparse
Multiview Cameras [63.186486240525554]
DeepMultiCap is a novel method for multi-person performance capture using sparse multi-view cameras.
Our method can capture time varying surface details without the need of using pre-scanned template models.
arXiv Detail & Related papers (2021-05-01T14:32:13Z) - Learning to compose 6-DoF omnidirectional videos using multi-sphere
images [16.423725132964776]
We propose a system that uses a 3D ConvNet to generate a multi-sphere images representation that can be experienced in 6-DoF VR.
The system utilizes conventional omnidirectional VR camera footage directly without the need for a depth map or segmentation mask.
A ground truth generation approach for high-quality artifact-free 6-DoF contents is proposed and can be used by the research and development community.
arXiv Detail & Related papers (2021-03-10T03:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.