Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
- URL: http://arxiv.org/abs/2302.08102v2
- Date: Fri, 18 Oct 2024 09:58:45 GMT
- Title: Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
- Authors: Minsu Kim, Hyung-Il Kim, Yong Man Ro,
- Abstract summary: We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
- Score: 66.94463981654216
- License:
- Abstract: Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.
Related papers
- Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation [10.014408469561834]
Dysarthric speech recognition (DSR) presents a formidable challenge due to inherent inter-speaker variability.
Traditional speaker adaptation methodologies typically involve fine-tuning models for each speaker.
We introduce a prototype-based approach that markedly improves DSR performance for unseen dysarthric speakers without additional fine-tuning.
arXiv Detail & Related papers (2024-07-26T02:03:23Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Extending Whisper with prompt tuning to target-speaker ASR [18.31992429200396]
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from overlapped utterances.
Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model.
This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR.
arXiv Detail & Related papers (2023-12-13T11:49:16Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Direction-Aware Adaptive Online Neural Speech Enhancement with an
Augmented Reality Headset in Real Noisy Conversational Environments [21.493664174262737]
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset.
It helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party)
The method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker.
arXiv Detail & Related papers (2022-07-15T05:14:27Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.