Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing
- URL: http://arxiv.org/abs/2302.12979v1
- Date: Sat, 25 Feb 2023 04:23:25 GMT
- Title: Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing
- Authors: Alexandra Chronopoulou, Brian Thompson, Prashant Mathur, Yogesh
Virkar, Surafel M. Lakew, Marcello Federico
- Abstract summary: We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
- Score: 71.02335065794384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic dubbing (AD) is the task of translating the original speech in a
video into target language speech. The new target language speech should
satisfy isochrony; that is, the new speech should be time aligned with the
original video, including mouth movements, pauses, hand gestures, etc. In this
paper, we propose training a model that directly optimizes both the translation
as well as the speech duration of the generated translations. We show that this
system generates speech that better matches the timing of the original speech,
compared to prior work, while simplifying the system architecture.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - Improving Isochronous Machine Translation with Target Factors and
Auxiliary Counters [71.02335065794384]
We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences.
We show that our model improves translation quality and isochrony compared to previous work.
arXiv Detail & Related papers (2023-05-22T16:36:04Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Direct simultaneous speech to speech translation [29.958601064888132]
We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model.
The model can start generating translation in the target speech before consuming the full source speech content.
arXiv Detail & Related papers (2021-10-15T17:59:15Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Fluent and Low-latency Simultaneous Speech-to-Speech Translation with
Self-adaptive Training [40.71155396456831]
Simultaneous speech-to-speech translation is widely useful but extremely challenging.
It needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay.
Current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower.
We propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates.
arXiv Detail & Related papers (2020-10-20T06:02:15Z) - From Speech-to-Speech Translation to Automatic Dubbing [28.95595497865406]
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing.
Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance.
arXiv Detail & Related papers (2020-01-19T07:03:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.