Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation
- URL: http://arxiv.org/abs/2309.11384v1
- Date: Wed, 20 Sep 2023 15:10:12 GMT
- Title: Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation
- Authors: Peter Pol\'ak, Ond\v{r}ej Bojar
- Abstract summary: Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
- Score: 6.153530338207679
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current simultaneous speech translation models can process audio only up to a
few seconds long. Contemporary datasets provide an oracle segmentation into
sentences based on human-annotated transcripts and translations. However, the
segmentation into sentences is not available in the real world. Current speech
segmentation approaches either offer poor segmentation quality or have to trade
latency for quality. In this paper, we propose a novel segmentation approach
for a low-latency end-to-end speech translation. We leverage the existing
speech translation encoder-decoder architecture with ST CTC and show that it
can perform the segmentation task without supervision or additional parameters.
To the best of our knowledge, our method is the first that allows an actual
end-to-end simultaneous speech translation, as the same model is used for
translation and segmentation at the same time. On a diverse set of language
pairs and in- and out-of-domain data, we show that the proposed approach
achieves state-of-the-art quality at no additional computational cost.
Related papers
- Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings [2.615008111842321]
We introduce an end-to-end scheme for topic segmentation using semantic speech encoders.
We propose a new benchmark for spoken news topic segmentation by utilizing a dataset featuring 1000 hours of publicly available recordings.
Our results demonstrate that while the traditional pipeline approach achieves a state-of-the-art $P_k$ score of 0.2431 for English, our end-to-end model delivers a competitive $P_k$ score of 0.2564.
arXiv Detail & Related papers (2024-09-10T05:24:36Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.
This includes the segmentation of the audio as well as the run-time of the different components.
We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z) - End-to-End Simultaneous Speech Translation with Differentiable
Segmentation [21.03142288187605]
SimulST outputs translation while receiving the streaming speech inputs.
segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model.
We propose Differentiable segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model.
arXiv Detail & Related papers (2023-05-25T14:25:12Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text
Translation [10.799623536095226]
For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem.
We compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings.
Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
arXiv Detail & Related papers (2022-10-24T16:06:33Z) - SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.