Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition
- URL: http://arxiv.org/abs/2505.17538v2
- Date: Thu, 14 Aug 2025 12:51:04 GMT
- Title: Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition
- Authors: Leonora Vesterbacka, Faton Rekathati, Robin Kurtz, Justyna Sikora, Agnes Toftgård,
- Abstract summary: This work presents a suite of fine-tuned Whisper models for Swedish, trained on a dataset of unprecedented size and variability for this mid-resourced language.<n>We report an overall improvement across model sizes compared to OpenAI's Whisper evaluated on Swedish.
- Score: 1.1417805445492082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents a suite of fine-tuned Whisper models for Swedish, trained on a dataset of unprecedented size and variability for this mid-resourced language. As languages of smaller sizes are often underrepresented in multilingual training datasets, substantial improvements in performance can be achieved by fine-tuning existing multilingual models, as shown in this work. This work reports an overall improvement across model sizes compared to OpenAI's Whisper evaluated on Swedish. Most notably, we report an average 47% reduction in WER comparing our best performing model to OpenAI's whisper-large-v3, in evaluations across FLEURS, Common Voice, and NST.
Related papers
- Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla [0.0]
This study investigates the performances of two state-of-the-art Automatic Speech Recognition (ASR) models, OpenAI's Whisper (Small & Large-V2) and Facebook's Wav2Vec-BERT on Bangla, a low-resource language.
arXiv Detail & Related papers (2025-07-02T17:44:54Z) - Efficient Multilingual ASR Finetuning via LoRA Language Experts [59.27778147311189]
This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper.<n>Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods.<n> Experimental results demonstrate that the proposed models yield approximately 10% and 15% relative performance gains in language-aware and language-agnostic scenarios.
arXiv Detail & Related papers (2025-06-11T07:06:27Z) - Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages.<n>We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios.<n>While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z) - Whisper Finetuning on Nepali Language [0.0]
This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models to improve transcription accuracy for the Nepali language.
We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation.
Our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models.
arXiv Detail & Related papers (2024-11-19T15:55:56Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Evaluating Large Language Models with Human Feedback: Establishing a Swedish Benchmark [0.0]
Large language models (LLMs) have demonstrated significant capabilities across numerous applications.
This study introduces a comprehensive human benchmark to assess the efficacy of prominent LLMs in understanding and generating Swedish language texts.
arXiv Detail & Related papers (2024-05-22T21:22:51Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Hearing voices at the National Library -- a speech corpus and acoustic
model for the Swedish language [0.0]
We develop new acoustic models for automated speech recognition (ASR) at the National Library of Sweden (KB)
We evaluate different approaches for a viable speech-to-text pipeline for audiovisual resources in Swedish, using the wav2vec 2.0 architecture.
We conclude by highlighting the potential of such technology for cultural heritage institutions with vast collections of previously unlabelled audiovisual data.
arXiv Detail & Related papers (2022-05-06T06:06:00Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - Large-Scale Self- and Semi-Supervised Learning for Speech Translation [48.06478781295623]
We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl.
Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four considered CoVoST 2 language pairs.
arXiv Detail & Related papers (2021-04-14T07:44:52Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.