Instruct-NeuralTalker: Editing Audio-Driven Talking Radiance Fields with
Instructions
- URL: http://arxiv.org/abs/2306.10813v2
- Date: Wed, 16 Aug 2023 08:02:02 GMT
- Title: Instruct-NeuralTalker: Editing Audio-Driven Talking Radiance Fields with
Instructions
- Authors: Yuqi Sun, Ruian He, Weimin Tan and Bo Yan
- Abstract summary: Recent neural talking radiance field methods have shown great success in audio-driven talking face synthesis.
We propose a novel interactive framework that utilizes human instructions to edit such implicit neural representations.
Our approach provides a significant improvement in rendering quality compared to state-of-the-art methods.
- Score: 16.45538217622068
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent neural talking radiance field methods have shown great success in
photorealistic audio-driven talking face synthesis. In this paper, we propose a
novel interactive framework that utilizes human instructions to edit such
implicit neural representations to achieve real-time personalized talking face
generation. Given a short speech video, we first build an efficient talking
radiance field, and then apply the latest conditional diffusion model for image
editing based on the given instructions and guiding implicit representation
optimization towards the editing target. To ensure audio-lip synchronization
during the editing process, we propose an iterative dataset updating strategy
and utilize a lip-edge loss to constrain changes in the lip region. We also
introduce a lightweight refinement network for complementing image details and
achieving controllable detail generation in the final rendered image. Our
method also enables real-time rendering at up to 30FPS on consumer hardware.
Multiple metrics and user verification show that our approach provides a
significant improvement in rendering quality compared to state-of-the-art
methods.
Related papers
- Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing.
Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance.
We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio.
Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z) - OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance [13.050998759819933]
"OpFlowTalker" is a novel approach that utilizes predicted optical flow changes from audio inputs rather than direct image predictions.
It smooths image transitions and aligns changes with semantic content.
We also developed an optical flow synchronization module that regulates both full-face and lip movements.
arXiv Detail & Related papers (2024-05-23T15:42:34Z) - Parametric Implicit Face Representation for Audio-Driven Facial
Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads.
Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models.
Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z) - Pix2Video: Video Editing using Image Diffusion [43.07444438561277]
We investigate how to use pre-trained image models for text-guided video editing.
Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame.
We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
arXiv Detail & Related papers (2023-03-22T16:36:10Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers [91.00397473678088]
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
arXiv Detail & Related papers (2022-12-09T16:32:46Z) - VideoReTalking: Audio-based Lip Synchronization for Talking Head Video
Editing In the Wild [37.93856291026653]
VideoReTalking is a new system to edit the faces of a real-world talking head video according to input audio.
It produces a high-quality and lip-syncing output video even with a different emotion.
arXiv Detail & Related papers (2022-11-27T08:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.