Related papers: VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

URL: http://arxiv.org/abs/2211.16934v2
Date: Tue, 5 Dec 2023 01:24:29 GMT
Title: VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing
Authors: Yihan Wu, Junliang Guo, Xu Tan, Chen Zhang, Bohan Li, Ruihua Song, Lei He, Sheng Zhao, Arul Menezes, Jiang Bian
Abstract summary: Video dubbing aims to translate the original speech in a film or television program into the speech in a target language. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech. We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
Score: 73.56970726406274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.

Related papers

Length Aware Speech Translation for Video Dubbing [27.946422755130868]
We develop a phoneme-based end-to-end length-sensitive speech translation model, which generates translations of varying lengths short, normal, and long.<n>We also introduce length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass.
arXiv Detail & Related papers (2025-05-31T23:01:50Z)
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing [15.134076873312809]
Cross-lingual dubbing system translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed.<n>We propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation.<n>We then synthesize speech based on the predicted units and source identity with a conditional flow matching model.
arXiv Detail & Related papers (2025-05-27T08:43:28Z)
Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates. New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z)
Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages [33.5772006275197]
End-to-end speech translation (ST) translates source language speech directly into target language text. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio. We present improvements to the duration alignment component of our sequence-to-sequence ST model.
arXiv Detail & Related papers (2024-11-11T21:39:21Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z)
Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters [71.02335065794384]
We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We show that our model improves translation quality and isochrony compared to previous work.
arXiv Detail & Related papers (2023-05-22T16:36:04Z)
Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z)
Large-scale multilingual audio visual dubbing [31.43873011591989]
We describe a system for large-scale audiovisual translation and dubbing. The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech. The visual content is translated by synthesizing lip movements for the speaker to match the translated audio.
arXiv Detail & Related papers (2020-11-06T18:58:15Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
From Speech-to-Speech Translation to Automatic Dubbing [28.95595497865406]
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance.
arXiv Detail & Related papers (2020-01-19T07:03:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.