FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised
Learning Features in Robust End-to-end Speech Recognition
- URL: http://arxiv.org/abs/2206.15056v1
- Date: Thu, 30 Jun 2022 06:39:40 GMT
- Title: FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised
Learning Features in Robust End-to-end Speech Recognition
- Authors: Szu-Jui Chen, Jiamin Xie, John H.L. Hansen
- Abstract summary: We propose to investigate effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models.
We show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
- Score: 34.40924909515384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning representations (SSLR) have resulted in robust
features for downstream tasks in many fields. Recently, several SSLRs have
shown promising results on automatic speech recognition (ASR) benchmark
corpora. However, previous studies have only shown performance for solitary
SSLRs as an input feature for ASR models. In this study, we propose to
investigate the effectiveness of diverse SSLR combinations using various fusion
methods within end-to-end (E2E) ASR models. In addition, we will show there are
correlations between these extracted SSLRs. As such, we further propose a
feature refinement loss for decorrelation to efficiently combine the set of
input features. For evaluation, we show that the proposed 'FeaRLESS learning
features' perform better than systems without the proposed feature refinement
loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
Related papers
- Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identification [2.552131151698595]
We proposed a novel self-supervision and supervision combining transformer-based person re-identification framework, namely SSSC-TransReID.
We designed a self-supervised contrastive learning branch, which can enhance the feature representation for person re-identification without negative samples or additional pre-training.
Our proposed model obtains superior Re-ID performance consistently and outperforms the state-of-the-art ReID methods by large margins on the mean average accuracy (mAP) and Rank-1 accuracy.
arXiv Detail & Related papers (2024-10-21T03:17:25Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.
Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.
This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features [32.765965044767356]
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems.
This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models.
arXiv Detail & Related papers (2024-05-02T11:48:30Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - Investigation of Ensemble features of Self-Supervised Pretrained Models
for Automatic Speech Recognition [0.3007949058551534]
Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks.
This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models.
arXiv Detail & Related papers (2022-06-11T12:43:00Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.