Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition
- URL: http://arxiv.org/abs/2205.11998v2
- Date: Wed, 25 May 2022 02:36:27 GMT
- Title: Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition
- Authors: Yuting Yang, Binbin Du, Yuke Li
- Abstract summary: In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation.
We present a novel method involving with multi-level modeling units, which integrates multi-level information for mandarin speech recognition.
- Score: 9.930655347717932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The choice of modeling units affects the performance of the acoustic modeling
and plays an important role in automatic speech recognition (ASR). In mandarin
scenarios, the Chinese characters represent meaning but are not directly
related to the pronunciation. Thus only considering the writing of Chinese
characters as modeling units is insufficient to capture speech features. In
this paper, we present a novel method involves with multi-level modeling units,
which integrates multi-level information for mandarin speech recognition.
Specifically, the encoder block considers syllables as modeling units, and the
decoder block deals with character modeling units. During inference, the input
feature sequences are converted into syllable sequences by the encoder block
and then converted into Chinese characters by the decoder block. This process
is conducted by a unified end-to-end model without introducing additional
conversion models. By introducing InterCE auxiliary task, our method achieves
competitive results with CER of 4.1%/4.6% and 4.6%/5.2% on the widely used
AISHELL-1 benchmark without a language model, using the Conformer and the
Transformer backbones respectively.
Related papers
- Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Pronunciation-aware unique character encoding for RNN Transducer-based
Mandarin speech recognition [38.60303603000269]
We propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems.
The proposed encoding is a combination of pronunciation-base syllable and character index (CI)
arXiv Detail & Related papers (2022-07-29T09:49:10Z) - Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech
Recognition with Pinyin and Character [15.999657143705045]
Pinyin and character as writing and spelling systems respectively are mutual promotion in the Mandarin Chinese language.
We propose a novel Mandarin Chinese ASR model with dual-decoder Transformer according to the characteristics of pinyin transcripts and character transcripts.
The results on the test sets of AISHELL-1 dataset show that the proposed Speech-Pinyin-Character-Interaction (S PCI) model without a language model achieves 9.85% character error rate (CER) on the test set.
arXiv Detail & Related papers (2022-01-26T07:59:03Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Generative Spoken Language Modeling from Raw Audio [42.153136032037175]
Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels)
We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks.
We test baseline systems consisting of a discrete speech encoder (returning discrete, low, pseudo-text units), a generative language model (trained on pseudo-text units) and a speech decoder.
arXiv Detail & Related papers (2021-02-01T21:41:40Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.