Related papers: Long-form Simultaneous Speech Translation: Thesis Proposal

Long-form Simultaneous Speech Translation: Thesis Proposal

URL: http://arxiv.org/abs/2310.11141v1
Date: Tue, 17 Oct 2023 10:44:05 GMT
Title: Long-form Simultaneous Speech Translation: Thesis Proposal
Authors: Peter Pol\'ak
Abstract summary: Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Deep learning has sparked significant interest in end-to-end (E2E) systems. This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting.
Score: 3.252719444437546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Traditionally, SST has been addressed primarily by cascaded systems that decompose the task into subtasks, including speech recognition, segmentation, and machine translation. However, the advent of deep learning has sparked significant interest in end-to-end (E2E) systems. Nevertheless, a major limitation of most approaches to E2E SST reported in the current literature is that they assume that the source speech is pre-segmented into sentences, which is a significant obstacle for practical, real-world applications. This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting, i.e., without pre-segmentation. We present a survey of the latest advancements in E2E SST, assess the primary obstacles in SST and its relevance to long-form scenarios, and suggest approaches to tackle these challenges.

Related papers

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation [13.559210762117061]
We propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems.
arXiv Detail & Related papers (2025-04-27T14:35:24Z)
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System? [7.252894835396412]
Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker's speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges.
arXiv Detail & Related papers (2024-12-24T15:26:31Z)
Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? [7.682929772871941]
prosody is rarely studied within the context of speech-to-text translation systems. End-to-end (E2E) systems have direct access to the speech signal when making translation decisions. A main challenge is the difficulty of evaluating prosody awareness in translation.
arXiv Detail & Related papers (2024-10-31T15:20:50Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling [13.757256085713571]
We present a novel two-stage prediction pipeline, named TAP-FM, proposed in this paper. Specifically, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations.
arXiv Detail & Related papers (2024-04-14T08:56:19Z)
Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization [73.85027121522295]
We introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Our proposed contextual E2E-ST outperforms the isolated utterance-based E2E-ST approach.
arXiv Detail & Related papers (2023-09-27T14:32:30Z)
Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models. We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z)
Textless Direct Speech-to-Speech Translation with Discrete Speech Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision. When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z)
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z)
Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks. In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.