Multi-pass Training and Cross-information Fusion for Low-resource
End-to-end Accented Speech Recognition
- URL: http://arxiv.org/abs/2306.11309v1
- Date: Tue, 20 Jun 2023 06:08:09 GMT
- Title: Multi-pass Training and Cross-information Fusion for Low-resource
End-to-end Accented Speech Recognition
- Authors: Xuefei Wang, Yanhua Long, Yijie Li, Haoran Wei
- Abstract summary: Low-resource accented speech recognition is one of the important challenges faced by current ASR technology.
We propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data.
- Score: 12.323309756880581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Low-resource accented speech recognition is one of the important challenges
faced by current ASR technology in practical applications. In this study, we
propose a Conformer-based architecture, called Aformer, to leverage both the
acoustic information from large non-accented and limited accented training
data. Specifically, a general encoder and an accent encoder are designed in the
Aformer to extract complementary acoustic information. Moreover, we propose to
train the Aformer in a multi-pass manner, and investigate three
cross-information fusion methods to effectively combine the information from
both general and accent encoders. All experiments are conducted on both the
accented English and Mandarin ASR tasks. Results show that our proposed methods
outperform the strong Conformer baseline by relative 10.2% to 24.5%
word/character error rate reduction on six in-domain and out-of-domain accented
test sets.
Related papers
- Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent.
Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Personalizing Keyword Spotting with Speaker Information [11.4457776449367]
Keywords spotting systems often struggle to generalize to a diverse population with various accents and age groups.
We propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM)
Our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost.
arXiv Detail & Related papers (2023-11-06T12:16:06Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.