Related papers: Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

URL: http://arxiv.org/abs/2406.10507v1
Date: Sat, 15 Jun 2024 05:13:19 GMT
Title: Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models
Authors: Ruchao Fan, Natarajan Balaji Shankar, Abeer Alwan,
Abstract summary: Speech foundation models (SFMs) have achieved state-of-the-art results for various speech tasks in supervised (e.g. Whisper) or self-supervised systems (e.g. WavLM)
Score: 23.383924361298874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech foundation models (SFMs) have achieved state-of-the-art results for various speech tasks in supervised (e.g. Whisper) or self-supervised systems (e.g. WavLM). However, the performance of SFMs for child ASR has not been systematically studied. In addition, there is no benchmark for child ASR with standard evaluations, making the comparisons of novel ideas difficult. In this paper, we initiate and present a comprehensive benchmark on several child speech databases based on various SFMs (Whisper, Wav2vec2.0, HuBERT, and WavLM). Moreover, we investigate finetuning strategies by comparing various data augmentation and parameter-efficient finetuning (PEFT) methods. We observe that the behaviors of these methods are different when the model size increases. For example, PEFT matches the performance of full finetuning for large models but worse for small models. To stabilize finetuning using augmented data, we propose a perturbation invariant finetuning (PIF) loss as a regularization.

Related papers

Multi-Scale Finetuning for Encoder-based Time Series Foundation Models [56.503053716053]
Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting.<n>We argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal performance.<n>We propose textbftextscfinetextbftextsctuning (textbfMSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process.
arXiv Detail & Related papers (2025-06-17T01:06:01Z)
Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR [13.307889110301502]
We compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.
arXiv Detail & Related papers (2024-10-16T06:35:56Z)
Examining Test-Time Adaptation for Personalized Child Speech Recognition [26.233159818496006]
Test-time adaptation (TTA) methods have shown great potential in bridging this domain gap.<n>We investigate the effectiveness of two widely used TTA methods-SUTA and SGEM-in adapting off-the-shelf ASR models and their fine-tuned versions for child speech recognition.<n>Our findings show that TTA significantly improves the performance of both off-the-shelf and fine-tuned ASR models, both on average and across individual child speakers.
arXiv Detail & Related papers (2024-09-19T21:40:07Z)
Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models [19.752712857873043]
This paper introduces Selective Self-Rehearsal (SSR), a fine-tuning approach that achieves performance comparable to the standard supervised fine-tuning (SFT) By utilizing the model's correct responses, SSR reduces model specialization during the fine-tuning stage. The effectiveness of SSR is demonstrated through experiments on the task of identifying unanswerable queries across various datasets.
arXiv Detail & Related papers (2024-09-07T10:21:03Z)
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains. We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [75.25114727856861]
Large language models (LLMs) tend to suffer from deterioration at the latter stage ofSupervised fine-tuning process. We introduce a simple disperse-then-merge framework to address the issue. Our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.
arXiv Detail & Related papers (2024-05-22T08:18:19Z)
Comparative Analysis of Different Efficient Fine Tuning Methods of Large Language Models (LLMs) in Low-Resource Setting [0.0]
We try to push the understanding of different fine-tuning strategies for large language models (LLMs) We compare state-of-the-art methods like vanilla fine-tuning and Pattern-Based Fine-Tuning (PBFT) on pre-trained models across two datasets, COLA and MNLI. Our findings suggest that these alternative strategies can exhibit out-of-domain generalization comparable to that of vanilla FT and PBFT.
arXiv Detail & Related papers (2024-05-21T20:08:52Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences [7.592727209806414]
Several ASR models exist in various sizes, with different inference costs leading to different performance levels. We propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription. By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
arXiv Detail & Related papers (2023-09-22T08:50:58Z)
MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models [90.99663022952498]
SuperB was proposed to evaluate the generalizability of self-supervised learning (SSL) speech models across various tasks. SuperB incurs high computational costs due to the large datasets and diverse tasks. We introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.
arXiv Detail & Related papers (2023-05-30T13:07:33Z)
Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks. We propose Sample-specific Ensemble of Source Models (SESoM) SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z)
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems [25.468558523679363]
We train and compare gender-specific wav2vec 2.0 models against models containing different degrees of gender balance in pre-training data. We observe lower overall performance using gender-specific pre-training before fine-tuning an end-to-end ASR system.
arXiv Detail & Related papers (2022-04-04T11:28:19Z)
From Sound Representation to Model Robustness [82.21746840893658]
We investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network. Averaged over various experiments on three environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures.
arXiv Detail & Related papers (2020-07-27T17:30:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.