Linguistic-Enhanced Transformer with CTC Embedding for Speech
Recognition
- URL: http://arxiv.org/abs/2210.14725v1
- Date: Tue, 25 Oct 2022 08:12:59 GMT
- Title: Linguistic-Enhanced Transformer with CTC Embedding for Speech
Recognition
- Authors: Xulong Zhang, Jianzong Wang, Ning Cheng, Mengyuan Zhao, Zhiyong Zhang,
Jing Xiao
- Abstract summary: Recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR)
We propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process.
Experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%.
- Score: 29.1423215212174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent emergence of joint CTC-Attention model shows significant
improvement in automatic speech recognition (ASR). The improvement largely lies
in the modeling of linguistic information by decoder. The decoder
joint-optimized with an acoustic encoder renders the language model from
ground-truth sequences in an auto-regressive manner during training. However,
the training corpus of the decoder is limited to the speech transcriptions,
which is far less than the corpus needed to train an acceptable language model.
This leads to poor robustness of decoder. To alleviate this problem, we propose
linguistic-enhanced transformer, which introduces refined CTC information to
decoder during training process, so that the decoder can be more robust. Our
experiments on AISHELL-1 speech corpus show that the character error rate (CER)
is relatively reduced by up to 7%. We also find that in joint CTC-Attention ASR
model, decoder is more sensitive to linguistic information than acoustic
information.
Related papers
- RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Knowledge Transfer from Large-scale Pretrained Language Models to
End-to-end Speech Recognizers [13.372686722688325]
Training of end-to-end speech recognizers always requires transcribed utterances.
This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
arXiv Detail & Related papers (2022-02-16T07:02:24Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Non-autoregressive Mandarin-English Code-switching Speech Recognition
with Pinyin Mask-CTC and Word Embedding Regularization [61.749126838659315]
Mandarin-English code-switching (CS) is frequently used among East and Southeast Asian people.
Recent successful non-autoregressive (NAR) ASR models remove the need for left-to-right beam decoding in autoregressive (AR) models.
We propose changing the Mandarin output target of the encoder to Pinyin for faster encoder training, and introduce Pinyin-to-Mandarin decoder to learn contextualized information.
arXiv Detail & Related papers (2021-04-06T03:01:09Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.