Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages
- URL: http://arxiv.org/abs/2411.07387v1
- Date: Mon, 11 Nov 2024 21:39:21 GMT
- Title: Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages
- Authors: Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue,
- Abstract summary: End-to-end speech translation (ST) translates source language speech directly into target language text.
Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio.
We present improvements to the duration alignment component of our sequence-to-sequence ST model.
- Score: 33.5772006275197
- License:
- Abstract: End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or characters generated by the Machine Translation model to approximate the source sentence's length without considering the isochrony of pauses and speech segments, as duration can vary between languages. To address this, we present improvements to the duration alignment component of our sequence-to-sequence ST model. Our method controls translation length by predicting the duration of speech and pauses in conjunction with the translation process. This is achieved by providing timing information to the decoder, ensuring it tracks the remaining duration for speech and pauses while generating the translation. The evaluation on the Zh-En test set of CoVoST 2, demonstrates that the proposed Isochrony-Controlled ST achieves 0.92 speech overlap and 8.9 BLEU, which has only a 1.4 BLEU drop compared to the ST baseline.
Related papers
- Improving Isochronous Machine Translation with Target Factors and
Auxiliary Counters [71.02335065794384]
We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences.
We show that our model improves translation quality and isochrony compared to previous work.
arXiv Detail & Related papers (2023-05-22T16:36:04Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Align, Write, Re-order: Explainable End-to-End Speech Translation via
Operation Sequence Generation [37.48971774827332]
We propose to generate ST tokens out-of-order while remembering how to re-order them later.
We examine two variants of such operation sequences which enable generation of monotonic transcriptions and non-monotonic translations.
arXiv Detail & Related papers (2022-11-11T02:29:28Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - Fluent and Low-latency Simultaneous Speech-to-Speech Translation with
Self-adaptive Training [40.71155396456831]
Simultaneous speech-to-speech translation is widely useful but extremely challenging.
It needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay.
Current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower.
We propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates.
arXiv Detail & Related papers (2020-10-20T06:02:15Z) - Is 42 the Answer to Everything in Subtitling-oriented Speech
Translation? [16.070428245677675]
Subtitling is becoming increasingly important for disseminating information.
We explore two methods for applying Speech Translation (ST) to subtitling.
arXiv Detail & Related papers (2020-06-01T17:02:28Z) - Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in
Multitask End-to-End Speech Translation [127.54315184545796]
Speech translation (ST) aims to learn transformations from speech in the source language to the text in the target language.
We propose to improve the multitask ST model by utilizing word embedding as the intermediate.
arXiv Detail & Related papers (2020-05-21T14:22:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.