Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT
- URL: http://arxiv.org/abs/2102.07594v2
- Date: Tue, 16 Feb 2021 01:58:56 GMT
- Title: Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT
- Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai
Zhang
- Abstract summary: We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
- Score: 72.93855288283059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based encoder-decoder (AED) models have achieved promising
performance in speech recognition. However, because the decoder predicts text
tokens (such as characters or words) in an autoregressive manner, it is
difficult for an AED model to predict all tokens in parallel. This makes the
inference speed relatively slow. We believe that because the encoder already
captures the whole speech utterance, which has the token-level relationship
implicitly, we can predict a token without explicitly autoregressive language
modeling. When the prediction of a token does not rely on other tokens, the
parallel prediction of all tokens in the sequence is realizable. Based on this
idea, we propose a non-autoregressive speech recognition model called LASO
(Listen Attentively, and Spell Once). The model consists of an encoder, a
decoder, and a position dependent summarizer (PDS). The three modules are based
on basic attention blocks. The encoder extracts high-level representations from
the speech. The PDS uses positional encodings corresponding to tokens to
convert the acoustic representations into token-level representations. The
decoder further captures token-level relationships with the self-attention
mechanism. At last, the probability distribution on the vocabulary is computed
for each token position. Therefore, speech recognition is re-formulated as a
position-wise classification problem. Further, we propose a cross-modal
transfer learning method to refine semantics from a large-scale pre-trained
language model BERT for improving the performance.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in
End-to-End Speech-to-Intent Systems [31.18865184576272]
This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis.
We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder.
Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets.
arXiv Detail & Related papers (2022-04-11T15:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.