Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder
- URL: http://arxiv.org/abs/2010.13047v3
- Date: Thu, 18 Feb 2021 15:01:11 GMT
- Title: Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder
- Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji
Watanabe
- Abstract summary: We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
- Score: 64.55176104620848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fast inference speed is an important goal towards real-world deployment of
speech translation (ST) systems. End-to-end (E2E) models based on the
encoder-decoder architecture are more suitable for this goal than traditional
cascaded systems, but their effectiveness regarding decoding speed has not been
explored so far. Inspired by recent progress in non-autoregressive (NAR)
methods in text-based translation, which generates target tokens in parallel by
eliminating conditional dependencies, we study the problem of NAR decoding for
E2E-ST. We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and
autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length
candidates generated from the former, which dramatically improves the
effectiveness of a large length beam with negligible overhead. We further
investigate effective length prediction methods from speech inputs and the
impact of vocabulary sizes. Experiments on four benchmarks show the
effectiveness of the proposed method in improving inference speed while
maintaining competitive translation quality compared to state-of-the-art AR
E2E-ST systems.
Related papers
- CTC-based Non-autoregressive Textless Speech-to-Speech Translation [38.99922762754443]
Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding.
Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly.
In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation.
arXiv Detail & Related papers (2024-06-11T15:00:33Z) - Bit Cipher -- A Simple yet Powerful Word Representation System that
Integrates Efficiently with Language Models [4.807347156077897]
Bit-cipher is a word representation system that eliminates the need of backpropagation and hyper-efficient dimensionality reduction techniques.
We perform probing experiments on part-of-speech (POS) tagging and named entity recognition (NER) to assess bit-cipher's competitiveness with classic embeddings.
By replacing embedding layers with cipher embeddings, our experiments illustrate the notable efficiency of cipher in accelerating the training process and attaining better optima.
arXiv Detail & Related papers (2023-11-18T08:47:35Z) - DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z) - Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler
Alignment of Embeddings for Asymmetrical dual encoders [89.29256833403169]
We introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods.
KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation.
Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
arXiv Detail & Related papers (2023-03-31T15:44:13Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language
Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models.
We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2.
E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.