INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced
Non-Native Speech Recognition
- URL: http://arxiv.org/abs/2305.16371v1
- Date: Thu, 25 May 2023 13:06:01 GMT
- Title: INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced
Non-Native Speech Recognition
- Authors: Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson and
Chang D. Yoo
- Abstract summary: We propose Information Theoretic Adversarial Prompt Tuning (INTapt) to mitigate representational bias in automatic speech recognition systems.
INTapt is trained simultaneously in the following two manners: (1) adversarial training to reduce accent feature dependence between the original input and the prompt-concatenated input, and (2) training to minimize CTC loss for improving ASR performance to a prompt-concatenated input.
Experimental results show that INTapt improves the performance of L2 English and increases feature similarity between L2 and L1 accents.
- Score: 43.228070238684786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems have attained unprecedented
performance with large speech models pre-trained based on self-supervised
speech representation learning. However, these pre-trained speech models suffer
from representational bias as they tend to better represent those prominent
accents (i.e., native (L1) English accent) in the pre-training speech corpus
than less represented accents, resulting in a deteriorated performance for
non-native (L2) English accents. Although there have been some approaches to
mitigate this issue, all of these methods require updating the pre-trained
model weights. In this paper, we propose Information Theoretic Adversarial
Prompt Tuning (INTapt), which introduces prompts concatenated to the original
input that can re-modulate the attention of the pre-trained model such that the
corresponding input resembles a native (L1) English speech without updating the
backbone weights. INTapt is trained simultaneously in the following two
manners: (1) adversarial training to reduce accent feature dependence between
the original input and the prompt-concatenated input and (2) training to
minimize CTC loss for improving ASR performance to a prompt-concatenated input.
Experimental results show that INTapt improves the performance of L2 English
and increases feature similarity between L2 and L1 accents.
Related papers
- Unveiling the Role of Pretraining in Direct Speech Translation [14.584351239812394]
We compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch.
We observe that, throughout the training, the randomly model struggles to incorporate information from the speech inputs for its predictions.
We propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training.
arXiv Detail & Related papers (2024-09-26T16:46:46Z) - Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Synthetic Cross-accent Data Augmentation for Automatic Speech
Recognition [18.154258453839066]
We improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation.
We include phonetic knowledge in the ACM training to provide accurate feedback about how well certain pronunciation patterns were recovered in the synthesized waveform.
We evaluate our approach on native and non-native English datasets and found that synthetically accented data helped the ASR to better understand speech from seen accents.
arXiv Detail & Related papers (2023-03-01T20:05:19Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.