LongFNT: Long-form Speech Recognition with Factorized Neural Transducer
- URL: http://arxiv.org/abs/2211.09412v1
- Date: Thu, 17 Nov 2022 08:48:27 GMT
- Title: LongFNT: Long-form Speech Recognition with Factorized Neural Transducer
- Authors: Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian
- Abstract summary: We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor.
The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
- Score: 64.75547712366784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional automatic speech recognition~(ASR) systems usually focus on
individual utterances, without considering long-form speech with useful
historical information, which is more practical in real scenarios. Simply
attending longer transcription history for a vanilla neural transducer model
shows no much gain in our preliminary experiments, since the prediction network
is not a pure language model. This motivates us to leverage the factorized
neural transducer structure, containing a real language model, the vocabulary
predictor. We propose the {LongFNT-Text} architecture, which fuses the
sentence-level long-form features directly with the output of the vocabulary
predictor and then embeds token-level long-form features inside the vocabulary
predictor, with a pre-trained contextual encoder RoBERTa to further boost the
performance. Moreover, we propose the {LongFNT} architecture by extending the
long-form speech to the original speech input and achieve the best performance.
The effectiveness of our LongFNT approach is validated on LibriSpeech and
GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction,
respectively.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Fast and accurate factorized neural transducer for text adaption of
end-to-end speech recognition models [23.21666928497697]
The improved adaptation ability of Factorized neural transducer (FNT) on text-only adaptation data came at the cost of lowered accuracy compared to the standard neural transducer model.
A combination of these approaches results in a relative word-error-rate reduction of 9.48% from the standard FNT model.
arXiv Detail & Related papers (2022-12-05T02:52:21Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Knowledge Transfer from Large-scale Pretrained Language Models to
End-to-end Speech Recognizers [13.372686722688325]
Training of end-to-end speech recognizers always requires transcribed utterances.
This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
arXiv Detail & Related papers (2022-02-16T07:02:24Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - From Universal Language Model to Downstream Task: Improving
RoBERTa-Based Vietnamese Hate Speech Detection [8.602181445598776]
We propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection.
Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.
arXiv Detail & Related papers (2021-02-24T09:30:55Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.