Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks
- URL: http://arxiv.org/abs/2305.03101v1
- Date: Thu, 4 May 2023 18:34:50 GMT
- Title: Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks
- Authors: Yun Tang, Anna Y. Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai
Ma, Paden D. Tomasello and Juan Pino
- Abstract summary: We propose a solution by combining Transducer and Attention based AED-Decoder (TAED) for speech-to-text tasks.
The new method leverages Transducer's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property.
We evaluate the proposed approach on the textscMuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks.
- Score: 28.440232737011453
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transducer and Attention based Encoder-Decoder (AED) are two widely used
frameworks for speech-to-text tasks. They are designed for different purposes
and each has its own benefits and drawbacks for speech-to-text tasks. In order
to leverage strengths of both modeling methods, we propose a solution by
combining Transducer and Attention based Encoder-Decoder (TAED) for
speech-to-text tasks. The new method leverages AED's strength in non-monotonic
sequence to sequence learning while retaining Transducer's streaming property.
In the proposed framework, Transducer and AED share the same speech encoder.
The predictor in Transducer is replaced by the decoder in the AED model, and
the outputs of the decoder are conditioned on the speech inputs instead of
outputs from an unconditioned language model. The proposed solution ensures
that the model is optimized by covering all possible read/write scenarios and
creates a matched environment for streaming applications. We evaluate the
proposed approach on the \textsc{MuST-C} dataset and the findings demonstrate
that TAED performs significantly better than Transducer for offline automatic
speech recognition (ASR) and speech-to-text translation (ST) tasks. In the
streaming case, TAED outperforms Transducer in the ASR task and one ST
direction while comparable results are achieved in another translation
direction.
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Rethinking Speech Recognition with A Multimodal Perspective via Acoustic
and Semantic Cooperative Decoding [29.80299587861207]
We propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR.
Unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively.
We show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
arXiv Detail & Related papers (2023-05-23T13:25:44Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.