Prosodic Alignment for off-screen automatic dubbing
- URL: http://arxiv.org/abs/2204.02530v1
- Date: Wed, 6 Apr 2022 01:02:58 GMT
- Title: Prosodic Alignment for off-screen automatic dubbing
- Authors: Yogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote
- Abstract summary: The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence.
This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses.
We extend the prosodic alignment model to address off-screen dubbing that requires less stringent synchronization constraints.
- Score: 17.7813193467431
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of automatic dubbing is to perform speech-to-speech translation
while achieving audiovisual coherence. This entails isochrony, i.e.,
translating the original speech by also matching its prosodic structure into
phrases and pauses, especially when the speaker's mouth is visible. In previous
work, we introduced a prosodic alignment model to address isochrone or
on-screen dubbing. In this work, we extend the prosodic alignment model to also
address off-screen dubbing that requires less stringent synchronization
constraints. We conduct experiments on four dubbing directions - English to
French, Italian, German and Spanish - on a publicly available collection of TED
Talks and on publicly available YouTube videos. Empirical results show that
compared to our previous work the extended prosodic alignment model provides
significantly better subjective viewing experience on videos in which on-screen
and off-screen automatic dubbing is applied for sentences with speakers mouth
visible and not visible, respectively.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing [125.86266166482704]
We propose StyleDubber, which switches dubbing learning from the frame level to phoneme level.
It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; and (3) a phoneme-guided lip aligner to maintain lip sync.
arXiv Detail & Related papers (2024-02-20T01:28:34Z) - Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z) - Dubbing in Practice: A Large Scale Study of Human Localization With
Insights for Automatic Dubbing [6.26764826816895]
We investigate how humans perform the task of dubbing video content from one language into another.
We leverage a novel corpus of 319.57 hours of video from 54 professionally produced titles.
arXiv Detail & Related papers (2022-12-23T04:12:52Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Neural Dubber: Dubbing for Silent Videos According to Scripts [22.814626504851752]
We propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task.
Neural Dubber is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech.
Experiments show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
arXiv Detail & Related papers (2021-10-15T17:56:07Z) - Machine Translation Verbosity Control for Automatic Dubbing [11.85772502779967]
We propose new methods to control the verbosity of machine translation output.
For experiments we use a public data set to dub English speeches into French, Italian, German and Spanish.
We report extensive subjective tests that measure the impact of MT verbosity control on the final quality of dubbed video clips.
arXiv Detail & Related papers (2021-10-08T01:19:10Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z) - From Speech-to-Speech Translation to Automatic Dubbing [28.95595497865406]
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing.
Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance.
arXiv Detail & Related papers (2020-01-19T07:03:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.