SARAH: Spatially Aware Real-time Agentic Humans
- URL: http://arxiv.org/abs/2602.18432v1
- Date: Fri, 20 Feb 2026 18:59:35 GMT
- Title: SARAH: Spatially Aware Real-time Agentic Humans
- Authors: Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, Alexander Richard,
- Abstract summary: We present the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset.<n>Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user.<n>We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment.
- Score: 58.32612596034656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z) - MOSPA: Human Motion Generation Driven by Spatial Audio [83.31594478750682]
We introduce the first comprehensive Spatial Audio-Driven Human Motion dataset, which contains diverse and high-quality spatial audio and motion data.<n>We develop a framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio.<n>Our method achieves state-of-the-art performance on this task.
arXiv Detail & Related papers (2025-07-16T06:33:11Z) - ARIG: Autoregressive Interactive Head Generation for Real-time Conversations [15.886402427095515]
Face-to-face communication, as a common human activity, motivates the research on interactive head generation.<n>Previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition.<n>We propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism.
arXiv Detail & Related papers (2025-07-01T06:38:14Z) - OT-Talk: Animating 3D Talking Head with Optimal Transportation [20.023346831300373]
OT-Talk is the first approach to leverage optimal transportation to optimize the learning model in talking head animation.<n>Building on existing learning frameworks, we utilize a pre-trained Hubert model to extract audio features and a transformer model to process temporal sequences.<n>Our experiments on two public audio-mesh datasets demonstrate that our method outperforms state-of-the-art techniques.
arXiv Detail & Related papers (2025-05-03T21:49:23Z) - HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures [8.50717565369252]
HoleGest is a novel neural network framework for automatic generation of high-quality, expressive co-speech gestures.<n>Our system learns a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements.<n>Our model achieves a level of realism close to the ground truth, providing an immersive user experience.
arXiv Detail & Related papers (2025-03-17T14:42:31Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.<n>Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - Synthesizing Diverse Human Motions in 3D Indoor Scenes [16.948649870341782]
We present a novel method for populating 3D indoor scenes with virtual humans that can navigate in the environment and interact with objects in a realistic manner.
Existing approaches rely on training sequences that contain captured human motions and the 3D scenes they interact with.
We propose a reinforcement learning-based approach that enables virtual humans to navigate in 3D scenes and interact with objects realistically and autonomously.
arXiv Detail & Related papers (2023-05-21T09:22:24Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.