Fugu-MT 論文翻訳(概要): Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

論文の概要: Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

arxiv url: http://arxiv.org/abs/2604.09473v1
Date: Fri, 10 Apr 2026 16:31:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.962237
Title: Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement
Title（参考訳）: 没入型ボリュームビデオの実現:6-DoFVRエンゲージメントのためのマルチモーダルフレームワーク
Authors: Zhengxian Yang, Shengqi Wang, Shi Pan, Hongshuai Li, Haoxiang Wang, Lin Li, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu,
Abstract要約: Immersive Volumetric Videosは大きな6-DoFインタラクション空間を提供するために設計された新しいボリュームメディアフォーマットである。我々は、空間指向のキャプチャー哲学に基づいて構築されたマルチビューでマルチモーダルなデータセットImViDを提案する。我々は,このような多視点映像データから音場再構成を行う最初の手法を提案する。
参考スコア（独自算出の注目度）: 27.56981802996559
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.
Abstract（参考訳）: 6-DoF視覚と聴覚の相互作用を密に統合する完全な没入体験は、バーチャルおよび拡張現実にとって不可欠である。このような体験は、コンピュータが生成したコンテンツを通じて達成できるが、実際のビデオから直接構築することは、ほとんど未解明のままである。 Immersive Volumetric Videosは,大容量の6-DoFインタラクション空間,オーディオ視覚フィードバック,高解像度で高フレームレートのダイナミックコンテンツを提供するために設計された,新しいボリュームメディアフォーマットである。 IVV構築を支援するために、空間指向のキャプチャー哲学に基づいて構築されたマルチビューでマルチモーダルなデータセットImViDを提案する。我々のカスタム・キャプチャ・リグは、モーション中のマルチビュー・オーディオの同時取得を可能にし、複雑な屋内シーンと屋外シーンのリッチ・フォアグラウンド・インタラクションと挑戦的なダイナミックスによる効率的なキャプチャを容易にする。このデータセットは、60 FPSで5K解像度のビデオを1-5分で提供し、既存のベンチマークよりも、よりリッチな空間的、時間的、マルチモーダルなカバレッジを提供する。このデータセットを活用することで,フロー誘導スパースの初期化,ジョイントカメラの時間キャリブレーション,複雑な動きの堅牢かつ正確なモデリングのための多段階の時空間監視といった,ガウス的な時空間表現に基づく動的光場再構成フレームワークを開発する。さらに,このような多視点映像データから音場再構成を行う最初の手法を提案する。これらのコンポーネントは共に、没入型ボリュームビデオ制作のための統一パイプラインを形成する。広範囲なベンチマークと没入型VR実験により、我々のパイプラインは、大きな6-DoF相互作用空間を持つ高品質で時間的に安定したオーディオヴィジュアルボリュームコンテンツを生成することを示した。本研究は,没入型ボリュームビデオの基礎的定義と実用的構築手法の両方を提供する。

論文の概要: Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

関連論文リスト