Related papers: Self-supervised learning of speech representations with Dutch archival data

Self-supervised learning of speech representations with Dutch archival data

URL: http://arxiv.org/abs/2507.04554v2
Date: Tue, 08 Jul 2025 12:27:54 GMT
Title: Self-supervised learning of speech representations with Dutch archival data
Authors: Nik Vaessen, Roeland Ordelman, David A. van Leeuwen,
Abstract summary: We show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance.<n>We convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX.<n>Finally, we achieve a state-of-the-art large wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.
Score: 8.504327926435158
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX. Thirdly, we compare mono-lingual and multi-lingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.

Related papers

Granary: Speech Recognition and Translation Dataset in 25 European Languages [37.561934855489504]
Granary is a large-scale collection of speech datasets for recognition and translation across 25 European languages.<n>This is the first open-source effort at this scale for both transcription and translation.
arXiv Detail & Related papers (2025-05-19T17:40:58Z)
Automatic Proficiency Assessment in L2 English Learners [51.652753736780205]
Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators.<n>This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription.
arXiv Detail & Related papers (2025-05-05T12:36:03Z)
Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z)
Federated Learning for ASR based on Wav2vec 2.0 [4.711492191554342]
We study the use of federated learning to train an ASR model based on a wav2vec 2.0 model pre-trained by self supervision. Experiments show that such a model can obtain, with no use of a language model, a word error rate of 10.92% on the official TED-LIUM 3 test set. We also analyse the ASR performance for speakers depending on their participation to the federated learning.
arXiv Detail & Related papers (2023-02-20T18:36:46Z)
Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data. We present an unsupervised domain adaptation technique for pre-trained speech models. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z)
Deploying self-supervised learning in the wild for hybrid automatic speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR) We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language. We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
Large-Scale Self- and Semi-Supervised Learning for Speech Translation [48.06478781295623]
We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four considered CoVoST 2 language pairs.
arXiv Detail & Related papers (2021-04-14T07:44:52Z)
Exploring wav2vec 2.0 on speaker verification and language identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning. In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification. For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.