Related papers: ChatAnything: Facetime Chat with LLM-Enhanced Personas

ChatAnything: Facetime Chat with LLM-Enhanced Personas

URL: http://arxiv.org/abs/2311.06772v1
Date: Sun, 12 Nov 2023 08:29:41 GMT
Title: ChatAnything: Facetime Chat with LLM-Enhanced Personas
Authors: Yilin Zhao, Xinbin Yuan, Shanghua Gao, Zhijie Lin, Qibin Hou, Jiashi Feng, Daquan Zhou
Abstract summary: We propose the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones. MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects.
Score: 87.76804680223003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/.

Related papers

Emotional Face-to-Speech [13.725558939494407]
Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. We introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning. We develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively.
arXiv Detail & Related papers (2025-02-03T04:48:50Z)
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model. It can synthesize smooth lip dynamics while preserving the speaker's identity. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space. Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z)
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder. In particular, our style includes the generation of head poses, thereby enhancing user perception. We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z)
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos. We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z)
Emotionally Enhanced Talking Face Generation [52.07451348895041]
We build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions. We show that our model can adapt to arbitrary identities, emotions, and languages. Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.
arXiv Detail & Related papers (2023-03-21T02:33:27Z)
FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method. It is very robust to background noise and can handle audio recorded in a variety of situations. It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z)
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech [33.01930038988336]
We introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework, called Face-TTS. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers.
arXiv Detail & Related papers (2023-02-27T11:59:28Z)
DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations. Most existing methods focused on single-person talking head generation. We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z)
Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.