Iterative Text-based Editing of Talking-heads Using Neural Retargeting
- URL: http://arxiv.org/abs/2011.10688v1
- Date: Sat, 21 Nov 2020 01:05:55 GMT
- Title: Iterative Text-based Editing of Talking-heads Using Neural Retargeting
- Authors: Xinwei Yao, Ohad Fried, Kayvon Fatahalian, Maneesh Agrawala
- Abstract summary: We present a text-based tool for editing talking-head video that enables an iterative editing workflow.
On each iteration users can edit the wording of the speech, further refine mouth motions if necessary to reduce artifacts and manipulate non-verbal aspects of the performance.
Our tool requires only 2-3 minutes of the target actor video and it synthesizes the video for each iteration in about 40 seconds.
- Score: 42.964779538134714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a text-based tool for editing talking-head video that enables an
iterative editing workflow. On each iteration users can edit the wording of the
speech, further refine mouth motions if necessary to reduce artifacts and
manipulate non-verbal aspects of the performance by inserting mouth gestures
(e.g. a smile) or changing the overall performance style (e.g. energetic,
mumble). Our tool requires only 2-3 minutes of the target actor video and it
synthesizes the video for each iteration in about 40 seconds, allowing users to
quickly explore many editing possibilities as they iterate. Our approach is
based on two key ideas. (1) We develop a fast phoneme search algorithm that can
quickly identify phoneme-level subsequences of the source repository video that
best match a desired edit. This enables our fast iteration loop. (2) We
leverage a large repository of video of a source actor and develop a new
self-supervised neural retargeting technique for transferring the mouth motions
of the source actor to the target actor. This allows us to work with relatively
short target actor videos, making our approach applicable in many real-world
editing scenarios. Finally, our refinement and performance controls give users
the ability to further fine-tune the synthesized results.
Related papers
- Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions [49.14827857853878]
ReimaginedAct comprises video understanding, reasoning, and editing modules.
Our method can accept not only direct instructional text prompts but also what if' questions to predict possible action changes.
arXiv Detail & Related papers (2024-03-11T22:46:46Z) - RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing
with Diffusion Models [19.792535444735957]
RAVE is a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training.
It produces high-quality videos while preserving original motion and semantic structure.
RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations.
arXiv Detail & Related papers (2023-12-07T18:43:45Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - Instruct-NeuralTalker: Editing Audio-Driven Talking Radiance Fields with
Instructions [16.45538217622068]
Recent neural talking radiance field methods have shown great success in audio-driven talking face synthesis.
We propose a novel interactive framework that utilizes human instructions to edit such implicit neural representations.
Our approach provides a significant improvement in rendering quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-06-19T10:03:11Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech
Editing [67.96138567288197]
This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet)
The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context.
It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.
arXiv Detail & Related papers (2022-02-21T02:05:14Z) - Intelligent Video Editing: Incorporating Modern Talking Face Generation
Algorithms in a Video Editor [44.36920938661454]
This paper proposes a video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities.
Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively.
Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality.
arXiv Detail & Related papers (2021-10-16T14:19:12Z) - Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z) - Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.