Streaming Simultaneous Speech Translation with Augmented Memory
Transformer
- URL: http://arxiv.org/abs/2011.00033v1
- Date: Fri, 30 Oct 2020 18:28:42 GMT
- Title: Streaming Simultaneous Speech Translation with Augmented Memory
Transformer
- Authors: Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti, Philipp Koehn, Juan
Pino
- Abstract summary: Transformer-based models have achieved state-of-the-art performance on speech translation tasks.
We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder.
- Score: 29.248366441276662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have achieved state-of-the-art performance on speech
translation tasks. However, the model architecture is not efficient enough for
streaming scenarios since self-attention is computed over an entire input
sequence and the computational cost grows quadratically with the length of the
input sequence. Nevertheless, most of the previous work on simultaneous speech
translation, the task of generating translations from partial audio input,
ignores the time spent in generating the translation when analyzing the
latency. With this assumption, a system may have good latency quality
trade-offs but be inapplicable in real-time scenarios. In this paper, we focus
on the task of streaming simultaneous speech translation, where the systems are
not only capable of translating with partial input but are also able to handle
very long or continuous input. We propose an end-to-end transformer-based
sequence-to-sequence model, equipped with an augmented memory transformer
encoder, which has shown great success on the streaming automatic speech
recognition task with hybrid or transducer-based models. We conduct an
empirical evaluation of the proposed model on segment, context and memory sizes
and we compare our approach to a transformer with a unidirectional mask.
Related papers
- Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - RealTranS: End-to-End Simultaneous Speech Translation with Convolutional
Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation.
It maps speech features into text space with a weighted-shrinking operation and a semantic encoder.
Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z) - Learning to Count Words in Fluent Speech enables Online Speech
Recognition [10.74796391075403]
We introduce Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting.
Experiments performed on the LRS2, LibriSpeech, and Aishell-1 datasets show that the online system performs comparable with the offline one when having a dynamic algorithmic delay of 5 segments.
arXiv Detail & Related papers (2020-06-08T20:49:39Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z) - Non-Autoregressive Machine Translation with Disentangled Context
Transformer [70.95181466892795]
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens.
We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts.
Our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
arXiv Detail & Related papers (2020-01-15T05:32:18Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.