UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL
Models
- URL: http://arxiv.org/abs/2402.08898v1
- Date: Wed, 14 Feb 2024 02:11:04 GMT
- Title: UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL
Models
- Authors: Ruchao Fan, Natarajan Balaji Shanka, and Abeer Alwan
- Abstract summary: We propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT.
The proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder.
- Score: 23.383924361298874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive automatic speech recognition (NASR) models have gained
attention due to their parallelism and fast inference. The encoder-based NASR,
e.g. connectionist temporal classification (CTC), can be initialized from the
speech foundation models (SFM) but does not account for any dependencies among
intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based
single-step non-autoregressive transformer (CASS-NAT), can mitigate the
dependency problem but is not able to efficiently integrate SFM. Inspired by
the success of recent work of speech-text joint pre-training with a shared
transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to
combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an
encoder as the major module, which can be the SFM. The encoder plays the role
of both the CASS-NAT encoder and decoder by two forward passes. The first pass
of the encoder accepts the speech signal as input, while the concatenation of
the speech signal and the token-level acoustic embedding is used as the input
for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1
datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results
and is better or comparable to CASS-NAT with only an encoder and hence, fewer
model parameters. Our codes are publicly available.
Related papers
- Using Large Language Model for End-to-End Chinese ASR and NER [35.876792804001646]
We present an encoder-decoder architecture that incorporates speech features through cross-attention.
We compare these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks.
Our experiments reveal that encoder-decoder architecture outperforms decoder-only architecture with a short context.
arXiv Detail & Related papers (2024-01-21T03:15:05Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - Joint Encoder-Decoder Self-Supervised Pre-training for ASR [0.0]
Self-supervised learning has shown tremendous success in various speech-related downstream tasks.
In this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning.
arXiv Detail & Related papers (2022-06-09T12:45:29Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk
and Far-Talk Speech Recognition [6.618254914001219]
We propose a dual-encoder ASR architecture for joint modeling of close-talk (CT) and far-talk (FT) speech.
The proposed dual-encoder architecture obtains up to 9% relative WER reduction when using both CT and FT input.
arXiv Detail & Related papers (2021-09-17T19:52:47Z) - Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.