Improving Isochronous Machine Translation with Target Factors and
Auxiliary Counters
- URL: http://arxiv.org/abs/2305.13204v1
- Date: Mon, 22 May 2023 16:36:04 GMT
- Title: Improving Isochronous Machine Translation with Target Factors and
Auxiliary Counters
- Authors: Proyag Pal, Brian Thompson, Yogesh Virkar, Prashant Mathur, Alexandra
Chronopoulou, Marcello Federico
- Abstract summary: We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences.
We show that our model improves translation quality and isochrony compared to previous work.
- Score: 71.02335065794384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To translate speech for automatic dubbing, machine translation needs to be
isochronous, i.e. translated speech needs to be aligned with the source in
terms of speech durations. We introduce target factors in a transformer model
to predict durations jointly with target language phoneme sequences. We also
introduce auxiliary counters to help the decoder to keep track of the timing
information while generating target phonemes. We show that our model improves
translation quality and isochrony compared to previous work where the
translation model is instead trained to predict interleaved sequences of
phonemes and durations.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z) - Direct Speech-to-speech Translation without Textual Annotation using
Bottleneck Features [13.44542301438426]
We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information.
Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach.
arXiv Detail & Related papers (2022-12-12T10:03:10Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Prosody-Aware Neural Machine Translation for Dubbing [9.49303003480503]
We introduce the task of prosody-aware machine translation which aims at generating translations suitable for dubbing.
Dubbing of a spoken sentence requires transferring the content as well as the prosodic structure of the source into the target language to preserve timing information.
We propose an implicit and explicit modeling approaches to integrate prosody information into neural machine translation.
arXiv Detail & Related papers (2021-12-16T01:11:08Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Streaming Simultaneous Speech Translation with Augmented Memory
Transformer [29.248366441276662]
Transformer-based models have achieved state-of-the-art performance on speech translation tasks.
We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder.
arXiv Detail & Related papers (2020-10-30T18:28:42Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.