Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition
- URL: http://arxiv.org/abs/2002.08126v1
- Date: Wed, 19 Feb 2020 12:01:33 GMT
- Title: Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition
- Authors: Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, Ye Bai
- Abstract summary: We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
- Score: 58.105818353866354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, language identity information has been utilized to improve the
performance of end-to-end code-switching (CS) speech recognition. However,
previous works use an additional language identification (LID) model as an
auxiliary module, which causes the system complex. In this work, we propose an
improved recurrent neural network transducer (RNN-T) model with language bias
to alleviate the problem. We use the language identities to bias the model to
predict the CS points. This promotes the model to learn the language identity
information directly from transcription, and no additional LID model is needed.
We evaluate the approach on a Mandarin-English CS corpus SEAME. Compared to our
RNN-T baseline, the proposed method can achieve 16.2% and 12.9% relative error
reduction on two test sets, respectively.
Related papers
- Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation [40.0365339798752]
Many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language.
In some cases, the input language can be given or estimated.
We accomplish this by introducing a simple and effective linear input network.
arXiv Detail & Related papers (2024-06-12T00:00:39Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Learning Cross-lingual Mappings for Data Augmentation to Improve
Low-Resource Speech Recognition [31.575930914290762]
Exploiting cross-lingual resources is an effective way to compensate for data scarcity of low resource languages.
We extend the concept of learnable cross-lingual mappings for end-to-end speech recognition.
The results show that any source language ASR model can be used for a low-resource target language recognition.
arXiv Detail & Related papers (2023-06-14T15:24:31Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Is Attention always needed? A Case Study on Language Identification from
Speech [1.162918464251504]
The present study introduces convolutional recurrent neural network (CRNN) based LID.
CRNN based LID is designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples.
The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar.
arXiv Detail & Related papers (2021-10-05T16:38:57Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.