Adapting an ASR Foundation Model for Spoken Language Assessment
- URL: http://arxiv.org/abs/2307.09378v1
- Date: Thu, 13 Jul 2023 16:01:58 GMT
- Title: Adapting an ASR Foundation Model for Spoken Language Assessment
- Authors: Rao Ma, Mengjie Qian, Mark J. F. Gales, Kate M. Knill
- Abstract summary: A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model.
Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available.
These models have a tendency to skip disfluencies and hesitations in the output.
Here a precise transcription of what a candidate said is needed.
- Score: 40.402050390096456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A crucial part of an accurate and reliable spoken language assessment system
is the underlying ASR model. Recently, large-scale pre-trained ASR foundation
models such as Whisper have been made available. As the output of these models
is designed to be human readable, punctuation is added, numbers are presented
in Arabic numeric form and abbreviations are included. Additionally, these
models have a tendency to skip disfluencies and hesitations in the output.
Though useful for readability, these attributes are not helpful for assessing
the ability of a candidate and providing feedback. Here a precise transcription
of what a candidate said is needed. In this paper, we give a detailed analysis
of Whisper outputs and propose two solutions: fine-tuning and soft prompt
tuning. Experiments are conducted on both public speech corpora and an English
learner dataset. Results show that we can effectively alter the decoding
behaviour of Whisper to generate the exact words spoken in the response.
Related papers
- Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages.
We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios.
While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z) - QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions [45.34333059156364]
We introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset.
We also propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models.
arXiv Detail & Related papers (2025-03-26T07:32:20Z) - Whispering in Amharic: Fine-tuning Whisper for Low-resource Language [3.2858851789879595]
This work explores fine-tuning OpenAI's Whisper automatic speech recognition model for Amharic.
We fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset.
The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets.
arXiv Detail & Related papers (2025-03-24T09:39:41Z) - Whisper Finetuning on Nepali Language [0.0]
This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models to improve transcription accuracy for the Nepali language.
We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation.
Our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models.
arXiv Detail & Related papers (2024-11-19T15:55:56Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - WER We Stand: Benchmarking Urdu ASR Models [3.5001789247699535]
This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models.
We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER)
We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset.
arXiv Detail & Related papers (2024-09-17T15:00:31Z) - A Suite for Acoustic Language Model Evaluation [20.802090523583196]
We introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response.
We evaluate several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method.
arXiv Detail & Related papers (2024-09-11T17:34:52Z) - LibriSpeech-PC: Benchmark for Evaluation of Punctuation and
Capitalization Capabilities of end-to-end ASR Models [58.790604613878216]
We introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models.
The benchmark includes a LibriSpeech-PC dataset with restored punctuation and capitalization, a novel evaluation metric called Punctuation Error Rate (PER) that focuses on punctuation marks, and initial baseline models.
arXiv Detail & Related papers (2023-10-04T16:23:37Z) - Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.