Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans
- URL: http://arxiv.org/abs/2511.12662v1
- Date: Sun, 16 Nov 2025 15:52:18 GMT
- Title: Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans
- Authors: Hongbin Huang, Junwei Li, Tianxin Xie, Zhuang Li, Cekai Weng, Yaodong Yang, Yue Luo, Li Liu, Jing Tang, Zhijing Shao, Zeyu Wang,
- Abstract summary: We present a high-fidelity, real-time conversational digital human system.<n>It combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation.<n>The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation.
- Score: 27.683599068167442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.
Related papers
- The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era [95.35748535806744]
We launch the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026.<n>This paper summarizes the dataset, track configurations, and the final results.
arXiv Detail & Related papers (2026-01-09T06:32:30Z) - TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation [72.46711449668814]
We introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner.<n>We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction, and speech quality.
arXiv Detail & Related papers (2025-12-23T12:04:23Z) - Towards Interactive Intelligence for Digital Humans [31.977798807410682]
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution.<n>We present Mio, an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer.
arXiv Detail & Related papers (2025-12-15T18:57:35Z) - Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z) - MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation [23.343080324521434]
We introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner.<n>Our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations.<n>To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources.
arXiv Detail & Related papers (2025-08-26T14:00:16Z) - Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z) - RITA: A Real-time Interactive Talking Avatars Framework [6.060251768347276]
RITA presents a high-quality real-time interactive framework built upon generative models.
Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions.
arXiv Detail & Related papers (2024-06-18T22:53:15Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in
Social Robots [0.040792653193642496]
This paper presents an initial implementation of a dialogue manager that enhances the traditional text-based prompts with real-time visual input.
The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency.
arXiv Detail & Related papers (2023-11-15T13:47:00Z) - Real-Time Gesture Recognition with Virtual Glove Markers [1.8352113484137629]
A real-time computer vision-based human-computer interaction tool for gesture recognition applications is proposed.
The system would be effective in real-time applications including social interaction through telepresence and rehabilitation.
arXiv Detail & Related papers (2022-07-06T14:56:08Z) - Enabling Harmonious Human-Machine Interaction with Visual-Context
Augmented Dialogue System: A Review [40.49926141538684]
Visual Context Augmented Dialogue System (VAD) has the potential to communicate with humans by perceiving and understanding multimodal information.
VAD possesses the potential to generate engaging and context-aware responses.
arXiv Detail & Related papers (2022-07-02T09:31:37Z) - Retrieval Augmentation Reduces Hallucination in Conversation [49.35235945543833]
We explore the use of neural-retrieval-in-the-loop architectures for knowledge-grounded dialogue.
We show that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks.
arXiv Detail & Related papers (2021-04-15T16:24:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.