Impact of Encoding and Segmentation Strategies on End-to-End
Simultaneous Speech Translation
- URL: http://arxiv.org/abs/2104.14470v1
- Date: Thu, 29 Apr 2021 16:33:08 GMT
- Title: Impact of Encoding and Segmentation Strategies on End-to-End
Simultaneous Speech Translation
- Authors: Ha Nguyen, Yannick Est\`eve, Laurent Besacier
- Abstract summary: This paper investigates two key aspects of end-to-end simultaneous speech translation: how to encode efficiently the continuous speech flow and how to segment the speech flow.
We extend our previously proposed end-to-end online decoding strategy and show that while replacing BLSTM by ULSTM encoding degrades performance in offline mode, it actually improves both efficiency and performance in online mode.
- Score: 17.78024523121448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Boosted by the simultaneous translation shared task at IWSLT 2020, promising
end-to-end online speech translation approaches were recently proposed. They
consist in incrementally encoding a speech input (in a source language) and
decoding the corresponding text (in a target language) with the best possible
trade-off between latency and translation quality. This paper investigates two
key aspects of end-to-end simultaneous speech translation: (a) how to encode
efficiently the continuous speech flow, and (b) how to segment the speech flow
in order to alternate optimally between reading (R: encoding input) and writing
(W: decoding output) operations. We extend our previously proposed end-to-end
online decoding strategy and show that while replacing BLSTM by ULSTM encoding
degrades performance in offline mode, it actually improves both efficiency and
performance in online mode. We also measure the impact of different methods to
segment the speech signal (using fixed interval boundaries, oracle word
boundaries or randomly set boundaries) and show that our best end-to-end online
decoding strategy is surprisingly the one that alternates R/W operations on
fixed size blocks on our English-German speech translation setup.
Related papers
- A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation [6.153530338207679]
Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
arXiv Detail & Related papers (2023-09-20T15:10:12Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - RealTranS: End-to-End Simultaneous Speech Translation with Convolutional
Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation.
It maps speech features into text space with a weighted-shrinking operation and a semantic encoder.
Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z) - An Empirical Study of End-to-end Simultaneous Speech Translation
Decoding Strategies [17.78024523121448]
This paper proposes a decoding strategy for end-to-end simultaneous speech translation.
We leverage end-to-end models trained in offline mode and conduct an empirical study for two language pairs.
arXiv Detail & Related papers (2021-03-04T18:55:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.