Related papers: VoluMe -- Authentic 3D Video Calls from Live Gaussian Splat Prediction

VoluMe -- Authentic 3D Video Calls from Live Gaussian Splat Prediction

URL: http://arxiv.org/abs/2507.21311v1
Date: Mon, 28 Jul 2025 20:07:55 GMT
Title: VoluMe -- Authentic 3D Video Calls from Live Gaussian Splat Prediction
Authors: Martin de La Gorce, Charlie Hewitt, Tibor Takacs, Robert Gerdisch, Zafiirah Hosenie, Givi Meishvili, Marek Kowalski, Thomas J. Cashman, Antonio Criminisi,
Abstract summary: We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed.<n>By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint.<n>We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods.
Score: 9.570954192915005
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. These approaches lead to constraints that are unwelcome and ill-fitting for videoconferencing applications. We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed, where the 3D representation is not only live and realistic, but also authentic to the input video. By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint (a property we call authenticity), while generalizing realistically to novel viewpoints. Additionally, we introduce a stability loss to obtain reconstructions that are temporally stable on video sequences. We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods, and demonstrate our approach in live one-to-one 3D meetings using only a standard 2D camera and display. This demonstrates that our approach can allow anyone to communicate volumetrically, via a method for 3D videoconferencing that is not only highly accessible, but also realistic and authentic.

Related papers

GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z)
Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion [22.185551913099598]
Single-image 3D portrait reconstruction has enabled telepresence systems to stream 3D portrait videos from a single camera in real-time.<n>However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance.<n>We propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views.
arXiv Detail & Related papers (2024-12-11T18:57:24Z)
Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z)
StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos [44.51044100125421]
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays.
arXiv Detail & Related papers (2024-09-11T17:52:07Z)
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model [16.14713604672497]
ReconX is a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task.<n>The proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition.<n> Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency.
arXiv Detail & Related papers (2024-08-29T17:59:40Z)
Coherent 3D Portrait Video Reconstruction via Triplane Fusion [21.381482393260406]
Per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. We propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information. Our method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.
arXiv Detail & Related papers (2024-05-01T18:08:51Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio. We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model. Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
Appearance-Preserving 3D Convolution for Video-based Person Re-identification [61.677153482995564]
We propose AppearancePreserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel. It is easy to combine AP3D with existing 3D ConvNets by simply replacing the original 3D convolution kernels with AP3Ds.
arXiv Detail & Related papers (2020-07-16T16:21:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.