Related papers: Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization

Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization

URL: http://arxiv.org/abs/2601.07518v1
Date: Mon, 12 Jan 2026 13:17:41 GMT
Title: Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization
Authors: Fangyu Lin, Yingdong Hu, Zhening Liu, Yufan Zhuang, Zehong Lin, Jun Zhang,
Abstract summary: Mon3tr is a novel Monocular 3D telepresence framework that integrates 3D Gaussian splatting (3DGS) based parametric human modeling.<n>A single monocular RGB camera is used to capture body motions and facial expressions in real time to drive the 3DGS-based parametric human model.<n>Our method achieves a PSNR of > 28 dB for novel poses, an end-to-end latency of 80 ms, and > 1000x bandwidth reduction compared to point-cloud streaming.
Score: 16.68162021163563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Immersive telepresence aims to transform human interaction in AR/VR applications by enabling lifelike full-body holographic representations for enhanced remote collaboration. However, existing systems rely on hardware-intensive multi-camera setups and demand high bandwidth for volumetric streaming, limiting their real-time performance on mobile devices. To overcome these challenges, we propose Mon3tr, a novel Monocular 3D telepresence framework that integrates 3D Gaussian splatting (3DGS) based parametric human modeling into telepresence for the first time. Mon3tr adopts an amortized computation strategy, dividing the process into a one-time offline multi-view reconstruction phase to build a user-specific avatar and a monocular online inference phase during live telepresence sessions. A single monocular RGB camera is used to capture body motions and facial expressions in real time to drive the 3DGS-based parametric human model, significantly reducing system complexity and cost. The extracted motion and appearance features are transmitted at < 0.2 Mbps over WebRTC's data channel, allowing robust adaptation to network fluctuations. On the receiver side, e.g., Meta Quest 3, we develop a lightweight 3DGS attribute deformation network to dynamically generate corrective 3DGS attribute adjustments on the pre-built avatar, synthesizing photorealistic motion and appearance at ~ 60 FPS. Extensive experiments demonstrate the state-of-the-art performance of our method, achieving a PSNR of > 28 dB for novel poses, an end-to-end latency of ~ 80 ms, and > 1000x bandwidth reduction compared to point-cloud streaming, while supporting real-time operation from monocular inputs across diverse scenarios. Our demos can be found at https://mon3tr3d.github.io.

Related papers

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera [54.967647497048205]
We present Stereo-Inertial Poser, a real-time motion capture system that estimates metric-accurate and shape-aware 3D human motion.<n>We replace the monocular RGB with stereo vision, enabling direct 3D keypoint extraction and body shape parameter estimation.<n>Our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
arXiv Detail & Related papers (2026-03-02T17:46:38Z)
Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z)
M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors [25.67875816218477]
Full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range.<n>Previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints.<n>To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists.
arXiv Detail & Related papers (2025-05-08T15:28:09Z)
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis [61.1662426227688]
Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization.<n>We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner.
arXiv Detail & Related papers (2025-03-26T02:47:27Z)
TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting [4.011241510647248]
We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals.<n>We show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.
arXiv Detail & Related papers (2025-03-21T10:40:37Z)
SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming with Arbitrary Length [2.4844080708094745]
This paper introduces SwinGS, a novel framework for training, delivering, and rendering volumetric video in a real-time streaming fashion.<n>We implement a prototype of SwinGS and demonstrate its streamability across various datasets and scenes.<n>We also develop an interactive WebGL viewer enabling real-time volumetric video playback on most devices with modern browsers.
arXiv Detail & Related papers (2024-09-12T05:33:15Z)
EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.<n>An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z)
WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections [8.261637198675151]
Novel View Synthesis (NVS) from unconstrained photo collections is challenging in computer graphics. We propose an efficient point-based differentiable rendering framework for scene reconstruction from photo collections. Our approach outperforms existing approaches on the rendering quality of novel view and appearance synthesis with high converge and rendering speed.
arXiv Detail & Related papers (2024-06-04T15:17:37Z)
InstantSplat: Sparse-view Gaussian Splatting in Seconds [91.77050739918037]
We introduce InstantSplat, a novel approach for addressing sparse-view 3D scene reconstruction at lightning-fast speed.<n>InstantSplat employs a self-supervised framework that optimize 3D scene representation and camera poses.<n>It achieves an acceleration of over 30x in reconstruction and improves visual quality (SSIM) from 0.3755 to 0.7624 compared to traditional SfM with 3D-GS.
arXiv Detail & Related papers (2024-03-29T17:29:58Z)
Multi-view data capture for dynamic object reconstruction using handheld augmented reality mobiles [0.0]
We propose a system to capture nearly-synchronous frame streams from multiple and moving handheld mobiles. Each mobile executes Simultaneous Localisation and Mapping on-board to estimate its pose, and uses a wireless communication channel to send or receive synchronisation triggers. We show the effectiveness of our system by employing it for 3D skeleton and volumetric reconstructions.
arXiv Detail & Related papers (2021-03-14T10:26:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.