Multimodal Attention Merging for Improved Speech Recognition and Audio
Event Classification
- URL: http://arxiv.org/abs/2312.14378v2
- Date: Fri, 9 Feb 2024 15:48:23 GMT
- Title: Multimodal Attention Merging for Improved Speech Recognition and Audio
Event Classification
- Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh,
Venkatesh Ravichandran, Phani Sankar Nidadavolu
- Abstract summary: Multimodal Attention Merging (MAM)
MAM reduces the relative Word Error Rate (WER) of an Automatic Speech Recognition (ASR) model by up to 6.70%.
Learnable-MAM, a data-driven approach to merging attention matrices, results in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC.
- Score: 20.206229252251717
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large foundation models using self-supervised objectives on
unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a
standard procedure. Unfortunately, the efficacy of this approach is often
constrained by both limited fine-tuning compute and scarcity in labeled
downstream data. We introduce Multimodal Attention Merging (MAM), an attempt
that facilitates direct knowledge transfer from attention matrices of models
rooted in high resource modalities, text and images, to those in
resource-constrained domains, speech and audio, employing a zero-shot paradigm.
MAM reduces the relative Word Error Rate (WER) of an Automatic Speech
Recognition (ASR) model by up to 6.70%, and relative classification error of an
Audio Event Classification (AEC) model by 10.63%. In cases where some
data/compute is available, we present Learnable-MAM, a data-driven approach to
merging attention matrices, resulting in a further 2.90% relative reduction in
WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning.
Related papers
- Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [13.009945735929445]
We propose a novel paradigm to solve salient problems plaguing the Automatic Speech Recognition field.
In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data.
In the second stage, two novel algorithms are utilized to generate a high-quality acoustic model.
arXiv Detail & Related papers (2024-10-21T03:48:23Z) - Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications.
Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space.
We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z) - Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models [2.4654745083407175]
We propose a new multi-rounds adaptation process that uses uncertainty to automate the annotation process.
This novel method streamlines data annotation and strategically selects data samples contributing most to model uncertainty.
Our results show that our approach leads to a 27% WER relative average improvement while requiring on average 45% less data than established baselines.
arXiv Detail & Related papers (2023-06-03T13:11:37Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - A Unified Model for Multi-class Anomaly Detection [33.534990722449066]
UniAD accomplishes anomaly detection for multiple classes with a unified framework.
We evaluate our algorithm on MVTec-AD and CIFAR-10 datasets.
arXiv Detail & Related papers (2022-06-08T06:05:09Z) - Statistical control for spatio-temporal MEG/EEG source imaging with
desparsified multi-task Lasso [102.84915019938413]
Non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG) offer promise of non-invasive techniques.
The problem of source localization, or source imaging, poses however a high-dimensional statistical inference challenge.
We propose an ensemble of desparsified multi-task Lasso (ecd-MTLasso) to deal with this problem.
arXiv Detail & Related papers (2020-09-29T21:17:16Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z) - Attention based on-device streaming speech recognition with large speech
corpus [16.702653972113023]
We present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus.
We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses.
For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.
arXiv Detail & Related papers (2020-01-02T04:24:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.