Long-form Simultaneous Speech Translation: Thesis Proposal
- URL: http://arxiv.org/abs/2310.11141v1
- Date: Tue, 17 Oct 2023 10:44:05 GMT
- Title: Long-form Simultaneous Speech Translation: Thesis Proposal
- Authors: Peter Pol\'ak
- Abstract summary: Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence.
Deep learning has sparked significant interest in end-to-end (E2E) systems.
This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting.
- Score: 3.252719444437546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Simultaneous speech translation (SST) aims to provide real-time translation
of spoken language, even before the speaker finishes their sentence.
Traditionally, SST has been addressed primarily by cascaded systems that
decompose the task into subtasks, including speech recognition, segmentation,
and machine translation. However, the advent of deep learning has sparked
significant interest in end-to-end (E2E) systems. Nevertheless, a major
limitation of most approaches to E2E SST reported in the current literature is
that they assume that the source speech is pre-segmented into sentences, which
is a significant obstacle for practical, real-world applications. This thesis
proposal addresses end-to-end simultaneous speech translation, particularly in
the long-form setting, i.e., without pre-segmentation. We present a survey of
the latest advancements in E2E SST, assess the primary obstacles in SST and its
relevance to long-form scenarios, and suggest approaches to tackle these
challenges.
Related papers
- Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? [7.682929772871941]
prosody is rarely studied within the context of speech-to-text translation systems.
End-to-end (E2E) systems have direct access to the speech signal when making translation decisions.
A main challenge is the difficulty of evaluating prosody awareness in translation.
arXiv Detail & Related papers (2024-10-31T15:20:50Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling [13.757256085713571]
We present a novel two-stage prediction pipeline, named TAP-FM, proposed in this paper.
Specifically, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner.
Our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations.
arXiv Detail & Related papers (2024-04-14T08:56:19Z) - Enhancing End-to-End Conversational Speech Translation Through Target
Language Context Utilization [73.85027121522295]
We introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments.
Our proposed contextual E2E-ST outperforms the isolated utterance-based E2E-ST approach.
arXiv Detail & Related papers (2023-09-27T14:32:30Z) - Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z) - Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.