FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised
Learning Features in Robust End-to-end Speech Recognition
- URL: http://arxiv.org/abs/2206.15056v1
- Date: Thu, 30 Jun 2022 06:39:40 GMT
- Title: FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised
Learning Features in Robust End-to-end Speech Recognition
- Authors: Szu-Jui Chen, Jiamin Xie, John H.L. Hansen
- Abstract summary: We propose to investigate effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models.
We show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
- Score: 34.40924909515384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning representations (SSLR) have resulted in robust
features for downstream tasks in many fields. Recently, several SSLRs have
shown promising results on automatic speech recognition (ASR) benchmark
corpora. However, previous studies have only shown performance for solitary
SSLRs as an input feature for ASR models. In this study, we propose to
investigate the effectiveness of diverse SSLR combinations using various fusion
methods within end-to-end (E2E) ASR models. In addition, we will show there are
correlations between these extracted SSLRs. As such, we further propose a
feature refinement loss for decorrelation to efficiently combine the set of
input features. For evaluation, we show that the proposed 'FeaRLESS learning
features' perform better than systems without the proposed feature refinement
loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
Related papers
- A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.
Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.
This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features [32.765965044767356]
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems.
This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models.
arXiv Detail & Related papers (2024-05-02T11:48:30Z) - Exploiting Self-Supervised Constraints in Image Super-Resolution [72.35265021054471]
This paper introduces a novel self-supervised constraint for single image super-resolution, termed SSC-SR.
SSC-SR uniquely addresses the divergence in image complexity by employing a dual asymmetric paradigm and a target model updated via exponential moving average to enhance stability.
Empirical evaluations reveal that our SSC-SR framework delivers substantial enhancements on a variety of benchmark datasets, achieving an average increase of 0.1 dB over EDSR and 0.06 dB over SwinIR.
arXiv Detail & Related papers (2024-03-30T06:18:50Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - Investigation of Ensemble features of Self-Supervised Pretrained Models
for Automatic Speech Recognition [0.3007949058551534]
Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks.
This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models.
arXiv Detail & Related papers (2022-06-11T12:43:00Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.