Ask2Mask: Guided Data Selection for Masked Speech Modeling
- URL: http://arxiv.org/abs/2202.12719v1
- Date: Thu, 24 Feb 2022 17:34:54 GMT
- Title: Ask2Mask: Guided Data Selection for Masked Speech Modeling
- Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu
Zhang and Pedro Moreno
- Abstract summary: Masked speech modeling (MSM) methods learn representations over speech frames which are randomly masked within an utterance.
They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations.
We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training.
- Score: 25.716834361963468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn
representations over speech frames which are randomly masked within an
utterance. While these methods improve performance of Automatic Speech
Recognition (ASR) systems, they have one major limitation. They treat all
unsupervised speech samples with equal weight, which hinders learning as not
all samples have relevant information to learn meaningful representations. In
this work, we address this limitation. We propose ask2mask (ATM), a novel
approach to focus on specific samples during MSM pre-training. ATM employs an
external ASR model or \textit{scorer} to weight unsupervised input samples in
two different ways: 1) A fine-grained data selection is performed by masking
over the highly confident input frames as chosen by the scorer. This allows the
model to learn meaningful representations. 2) ATM is further extended to focus
at utterance-level by weighting the final MSM loss with the utterance-level
confidence score. We conduct fine-tuning experiments on two well-benchmarked
corpora: LibriSpeech (matching the pre-training data) and Commonvoice,
TED-LIUM, AMI and CHiME-6 (not matching the pre-training data). The results
substantiate the efficacy of ATM on significantly improving the recognition
performance under mismatched conditions (up to 11.6\% relative over published
results and upto 4.46\% relative over our internal baseline) while still
yielding modest improvements under matched conditions.
Related papers
- DM-Codec: Distilling Multimodal Representations for Speech Tokenization [11.433520275513803]
DM-Codec is a language model-guided distillation method that incorporates contextual information.
It significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset.
arXiv Detail & Related papers (2024-10-19T07:14:14Z) - Introducing Model Inversion Attacks on Automatic Speaker Recognition [0.9558392439655015]
Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data.
We present an approach to (1) reconstruct audio samples from a trained ML model and (2) extract intermediate voice feature representations which provide valuable insights into the speakers' biometrics.
Our sliding MI extends standard MI by iteratively inverting overlapping chunks of the audio samples.
We show that one can use the inverted audio data to generate spoofed audio samples to impersonate a speaker, and execute voice-protected commands for highly secured systems.
arXiv Detail & Related papers (2023-01-09T08:51:15Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Towards Semi-Supervised Deep Facial Expression Recognition with An
Adaptive Confidence Margin [92.76372026435858]
We learn an Adaptive Confidence Margin (Ada-CM) to fully leverage all unlabeled data for semi-supervised deep facial expression recognition.
All unlabeled samples are partitioned into two subsets by comparing their confidence scores with the adaptively learned confidence margin.
Our method achieves state-of-the-art performance, especially surpassing fully-supervised baselines in a semi-supervised manner.
arXiv Detail & Related papers (2022-03-23T11:43:29Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - W2v-BERT: Combining Contrastive Learning and Masked Language Modeling
for Self-Supervised Speech Pre-Training [49.47516627019855]
w2v-BERT is a framework that combines contrastive learning and pre-supervised speech learning.
Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models.
arXiv Detail & Related papers (2021-08-07T06:29:36Z) - Meta Auxiliary Learning for Facial Action Unit Detection [84.22521265124806]
We consider learning AU detection and facial expression recognition in a multi-task manner.
The performance of the AU detection task cannot be always enhanced due to the negative transfer in the multi-task scenario.
We propose a Meta Learning method (MAL) that automatically selects highly related FE samples by learning adaptative weights for the training FE samples in a meta learning manner.
arXiv Detail & Related papers (2021-05-14T02:28:40Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.