Related papers: Voice Adaptation for Swiss German

Voice Adaptation for Swiss German

URL: http://arxiv.org/abs/2505.22054v1
Date: Wed, 28 May 2025 07:24:40 GMT
Title: Voice Adaptation for Swiss German
Authors: Samuel Stucki, Jan Deriu, Mark Cieliebak,
Abstract summary: This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech.<n>For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes.<n>We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect.
Score: 7.4162190889971
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.

Related papers

Advancing STT for Low-Resource Real-World Speech [0.0]
This paper introduces the new SRB-300 dataset, a 300-hour annotated speech corpus.<n>It captures spontaneous speech across all major Swiss dialects recorded in various realistic environments.<n>We fine-tuned multiple OpenAI Whisper models on the SRB-300 dataset, achieving notable enhancements over previous zero-shot performance metrics.
arXiv Detail & Related papers (2025-06-10T14:22:48Z)
Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect [52.1701152610258]
Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance. For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
arXiv Detail & Related papers (2024-01-25T18:59:32Z)
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages. We developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions [5.6787416472329495]
We present STT4SG-350 (Speech-to-Text for Swiss German), a corpus of Swiss German speech annotated with Standard German text at the sentence level. The data is collected using a web app in which the speakers are shown Standard German sentences, which they translate to Swiss German and record. It contains 343 hours of speech from all dialect regions and is the largest public speech corpus for Swiss German to date.
arXiv Detail & Related papers (2023-05-30T08:49:38Z)
Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z)
Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. Main ingredients are a new dataset based on readings of publicly available religious texts. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z)
Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed. We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z)
SDS-200: A Swiss German Speech to Standard German Text Corpus [5.370317759946287]
We present SDS-200, a corpus of Swiss German dialectal speech with Standard German text translations. The data was collected using a web recording tool that is open to the public. The data consists of 200 hours of speech by around 4000 different speakers and covers a large part of the Swiss-German dialect landscape.
arXiv Detail & Related papers (2022-05-19T12:16:29Z)
Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft's Submission to SwissText 2021 [17.675379299410054]
Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland. We propose a hybrid automatic speech recognition system with a lexicon that incorporates translations. Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.
arXiv Detail & Related papers (2021-06-15T13:34:02Z)
SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference. Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z)
Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data. Our model is able to recognize unseen phonemes in the target language without any training data. It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.