Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases
- URL: http://arxiv.org/abs/2402.00632v1
- Date: Thu, 1 Feb 2024 14:46:35 GMT
- Title: Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases
- Authors: Giulio Zhou, Tsz Kin Lam, Alexandra Birch, Barry Haddow
- Abstract summary: This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
- Score: 79.07111754406841
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech-to-Text Translation (S2TT) has typically been addressed with cascade
systems, where speech recognition systems generate a transcription that is
subsequently passed to a translation model. While there has been a growing
interest in developing direct speech translation systems to avoid propagating
errors and losing non-verbal content, prior work in direct S2TT has struggled
to conclusively establish the advantages of integrating the acoustic signal
directly into the translation process. This work proposes using contrastive
evaluation to quantitatively measure the ability of direct S2TT systems to
disambiguate utterances where prosody plays a crucial role. Specifically, we
evaluated Korean-English translation systems on a test set containing
wh-phrases, for which prosodic features are necessary to produce translations
with the correct intent, whether it's a statement, a yes/no question, a
wh-question, and more. Our results clearly demonstrate the value of direct
translation systems over cascade translation models, with a notable 12.9%
improvement in overall accuracy in ambiguous cases, along with up to a 15.6%
increase in F1 scores for one of the major intent categories. To the best of
our knowledge, this work stands as the first to provide quantitative evidence
that direct S2TT models can effectively leverage prosody. The code for our
evaluation is openly accessible and freely available for review and
utilisation.
Related papers
- Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? [7.682929772871941]
prosody is rarely studied within the context of speech-to-text translation systems.
End-to-end (E2E) systems have direct access to the speech signal when making translation decisions.
A main challenge is the difficulty of evaluating prosody awareness in translation.
arXiv Detail & Related papers (2024-10-31T15:20:50Z) - KIT's Multilingual Speech Translation System for IWSLT 2023 [58.5152569458259]
We describe our speech translation system for the multilingual track of IWSLT 2023.
The task requires translation into 10 languages of varying amounts of resources.
Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation.
arXiv Detail & Related papers (2023-06-08T16:13:20Z) - Improving Cascaded Unsupervised Speech Translation with Denoising
Back-translation [70.33052952571884]
We propose to build a cascaded speech translation system without leveraging any kind of paired data.
We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS.
arXiv Detail & Related papers (2023-05-12T13:07:51Z) - A Holistic Cascade System, benchmark, and Human Evaluation Protocol for
Expressive Speech-to-Speech Translation [45.47457657122893]
Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy.
Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time.
We propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation.
arXiv Detail & Related papers (2023-01-25T14:27:00Z) - Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z) - Proficiency assessment of L2 spoken English using wav2vec 2.0 [3.4012007729454816]
We use wav2vec 2.0 for assessing overall and individual aspects of proficiency on two small datasets.
We find that this approach significantly outperforms the BERT-based baseline system trained on ASR and manual transcriptions used for comparison.
arXiv Detail & Related papers (2022-10-24T12:36:49Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.