From Speech-to-Speech Translation to Automatic Dubbing
- URL: http://arxiv.org/abs/2001.06785v3
- Date: Sun, 2 Feb 2020 21:54:13 GMT
- Title: From Speech-to-Speech Translation to Automatic Dubbing
- Authors: Marcello Federico, Robert Enyedi, Roberto Barra-Chicote, Ritwik Giri,
Umut Isik, Arvindh Krishnaswamy and Hassan Sawaf
- Abstract summary: We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing.
Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance.
- Score: 28.95595497865406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present enhancements to a speech-to-speech translation pipeline in order
to perform automatic dubbing. Our architecture features neural machine
translation generating output of preferred length, prosodic alignment of the
translation with the original speech segments, neural text-to-speech with fine
tuning of the duration of each utterance, and, finally, audio rendering to
enriches text-to-speech output with background noise and reverberation
extracted from the original audio. We report on a subjective evaluation of
automatic dubbing of excerpts of TED Talks from English into Italian, which
measures the perceived naturalness of automatic dubbing and the relative
importance of each proposed enhancement.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Machine Translation Verbosity Control for Automatic Dubbing [11.85772502779967]
We propose new methods to control the verbosity of machine translation output.
For experiments we use a public data set to dub English speeches into French, Italian, German and Spanish.
We report extensive subjective tests that measure the impact of MT verbosity control on the final quality of dubbed video clips.
arXiv Detail & Related papers (2021-10-08T01:19:10Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Text-to-speech for the hearing impaired [0.0]
Text-to-speech (TTS) systems can compensate for a hearing loss at the source rather than correcting for it at the receiving end.
We propose an algorithm that restores loudness to normal perception at a high resolution in time, frequency and level.
arXiv Detail & Related papers (2020-12-03T18:52:03Z) - SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech
Translation [12.292167129361825]
We propose autoencoding speaker conversion for training data augmentation in automatic speech translation.
This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice.
Our method compares favorably to SpecAugment on English$to$French and English$to$Romanian automatic speech translation (AST) tasks.
arXiv Detail & Related papers (2020-02-27T16:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.