VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing
- URL: http://arxiv.org/abs/2211.16934v2
- Date: Tue, 5 Dec 2023 01:24:29 GMT
- Title: VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing
- Authors: Yihan Wu, Junliang Guo, Xu Tan, Chen Zhang, Bohan Li, Ruihua Song, Lei
He, Sheng Zhao, Arul Menezes, Jiang Bian
- Abstract summary: Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
- Score: 73.56970726406274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video dubbing aims to translate the original speech in a film or television
program into the speech in a target language, which can be achieved with a
cascaded system consisting of speech recognition, machine translation and
speech synthesis. To ensure the translated speech to be well aligned with the
corresponding video, the length/duration of the translated speech should be as
close as possible to that of the original speech, which requires strict length
control. Previous works usually control the number of words or characters
generated by the machine translation model to be similar to the source
sentence, without considering the isochronicity of speech as the speech
duration of words/characters in different languages varies. In this paper, we
propose a machine translation system tailored for the task of video dubbing,
which directly considers the speech duration of each token in translation, to
match the length of source and target speech. Specifically, we control the
speech length of generated sentence by guiding the prediction of each word with
the duration information, including the speech duration of itself as well as
how much duration is left for the remaining words. We design experiments on
four language directions (German -> English, Spanish -> English, Chinese <->
English), and the results show that the proposed method achieves better length
control ability on the generated speech than baseline methods. To make up the
lack of real-world datasets, we also construct a real-world test set collected
from films to provide comprehensive evaluations on the video dubbing task.
Related papers
- Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages [33.5772006275197]
End-to-end speech translation (ST) translates source language speech directly into target language text.
Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio.
We present improvements to the duration alignment component of our sequence-to-sequence ST model.
arXiv Detail & Related papers (2024-11-11T21:39:21Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - Improving Isochronous Machine Translation with Target Factors and
Auxiliary Counters [71.02335065794384]
We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences.
We show that our model improves translation quality and isochrony compared to previous work.
arXiv Detail & Related papers (2023-05-22T16:36:04Z) - Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z) - Large-scale multilingual audio visual dubbing [31.43873011591989]
We describe a system for large-scale audiovisual translation and dubbing.
The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech.
The visual content is translated by synthesizing lip movements for the speaker to match the translated audio.
arXiv Detail & Related papers (2020-11-06T18:58:15Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - From Speech-to-Speech Translation to Automatic Dubbing [28.95595497865406]
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing.
Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance.
arXiv Detail & Related papers (2020-01-19T07:03:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.