Adaptive Feature Selection for End-to-End Speech Translation
- URL: http://arxiv.org/abs/2010.08518v2
- Date: Tue, 20 Oct 2020 13:53:39 GMT
- Title: Adaptive Feature Selection for End-to-End Speech Translation
- Authors: Biao Zhang, Ivan Titov, Barry Haddow, Rico Sennrich
- Abstract summary: We propose adaptive feature selection (AFS) for encoder-decoder based E2E speech translation (ST)
We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to SR.
We take L0DROP as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions.
- Score: 87.07211937607102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Information in speech signals is not evenly distributed, making it an
additional challenge for end-to-end (E2E) speech translation (ST) to learn to
focus on informative features. In this paper, we propose adaptive feature
selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR
encoder and apply AFS to dynamically estimate the importance of each encoded
speech feature to SR. A ST encoder, stacked on top of the ASR encoder, then
receives the filtered features from the (frozen) ASR encoder. We take L0DROP
(Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech
features with respect to both temporal and feature dimensions. Results on
LibriSpeech En-Fr and MuST-C benchmarks show that AFS facilitates learning of
ST by pruning out ~84% temporal features, yielding an average translation gain
of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces
the performance gap compared to the cascade baseline, and outperforms it on
LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation)
Related papers
- Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR [74.38242498079627]
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable.
In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems.
arXiv Detail & Related papers (2024-09-13T13:01:09Z) - Decoder-only Architecture for Streaming End-to-end Speech Recognition [45.161909551392085]
We propose to use a decoder-only architecture for blockwise streaming automatic speech recognition (ASR)
In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder.
Our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model.
arXiv Detail & Related papers (2024-06-23T13:50:08Z) - Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation [14.410024368174872]
This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for simultaneous speech translation (SST)
The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressive Integrate-and-Fire mechanism.
Experiments on the Fisher-CallHome Spanish (Es-En) and MuST-C En-De data show that the LS-Transducer-SST gives a better quality-latency trade-off than existing popular methods.
arXiv Detail & Related papers (2024-06-06T22:39:43Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Joint Encoder-Decoder Self-Supervised Pre-training for ASR [0.0]
Self-supervised learning has shown tremendous success in various speech-related downstream tasks.
In this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning.
arXiv Detail & Related papers (2022-06-09T12:45:29Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.