Related papers: AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

URL: http://arxiv.org/abs/2502.13133v1
Date: Tue, 18 Feb 2025 18:56:18 GMT
Title: AV-Flow: Transforming Text to Audio-Visual Human-like Interactions
Authors: Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollhoefer, Dimitris Samaras, Alexander Richard,
Abstract summary: AV-Flow is an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input.<n>We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose.
Score: 101.31009576033776
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/

Related papers

Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis [44.503709089687014]
We propose THUNDER, a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. We show that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.
arXiv Detail & Related papers (2025-04-18T00:24:52Z)
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations [11.101103116878438]
We propose INFP, a novel audio-driven head generation framework for dyadic interaction.<n>INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage.<n>To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet.
arXiv Detail & Related papers (2024-12-05T10:20:34Z)
GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.<n>We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z)
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space. Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet. We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning [20.51814865676907]
It would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. We propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements.
arXiv Detail & Related papers (2021-12-06T02:53:51Z)
Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.