Related papers: A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

URL: http://arxiv.org/abs/2602.04913v1
Date: Wed, 04 Feb 2026 02:19:46 GMT
Title: A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model
Authors: Xiaolin Hu, Hang Yuan, Xinzhu Sang, Binbin Yan, Zhou Yu, Cong Huang, Kai Chen,
Abstract summary: A$2$-LLM is an end-to-end conversational audio avatar model that explains language, audio prosody, and 3D facial motion within a unified framework.<n>By deep semantic understanding, A$2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization.
Score: 39.89874984616492
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A$^2$-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A$^2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics [40.86039227407712]
We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation.<n>It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history.<n>Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set.
arXiv Detail & Related papers (2025-12-17T11:37:35Z)
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models [93.844257719952]
We introduce the Game-Time Benchmark framework to assess temporal capabilities.<n>Our evaluation of diverse SLM models reveals a clear performance disparity.<n>The GameTime Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI.
arXiv Detail & Related papers (2025-09-30T15:23:39Z)
FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z)
EAI-Avatar: Emotion-Aware Interactive Talking Head Generation [35.56554951482687]
We propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions.<n>Our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states.
arXiv Detail & Related papers (2025-08-25T13:07:03Z)
Real-Time Textless Dialogue Generation [23.456302461693053]
We propose a real-time, textless spoken dialogue generation model (RTTL-DG)<n>Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly.<n>Our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems.
arXiv Detail & Related papers (2025-01-08T23:21:43Z)
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences. We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style. Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.