SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German
- URL: http://arxiv.org/abs/2103.11401v1
- Date: Sun, 21 Mar 2021 14:00:09 GMT
- Title: SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German
- Authors: Pelin Dogan-Sch\"onberger, Julian M\"ader, Thomas Hofmann
- Abstract summary: We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference.
Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
- Score: 22.30271453485001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Swiss German is a dialect continuum whose natively acquired dialects
significantly differ from the formal variety of the language. These dialects
are mostly used for verbal communication and do not have standard orthography.
This has led to a lack of annotated datasets, rendering the use of many NLP
methods infeasible. In this paper, we introduce the first annotated parallel
corpus of spoken Swiss German across 8 major dialects, plus a Standard German
reference. Our goal has been to create and to make available a basic dataset
for employing data-driven NLP applications in Swiss German. We present our data
collection procedure in detail and validate the quality of our corpus by
conducting experiments with the recent neural models for speech synthesis.
Related papers
- A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation [19.535404632372042]
Betthupferl is an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany.<n>We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them.<n>We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions.
arXiv Detail & Related papers (2025-06-03T14:02:52Z) - Voice Adaptation for Swiss German [7.4162190889971]
This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech.<n>For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes.<n>We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect.
arXiv Detail & Related papers (2025-05-28T07:24:40Z) - Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z) - Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing.
Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Modular Adaptation of Multilingual Encoders to Written Swiss German
Dialect [52.1701152610258]
Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance.
For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
arXiv Detail & Related papers (2024-01-25T18:59:32Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - A Benchmark for Evaluating Machine Translation Metrics on Dialects
Without Standard Orthography [40.04973667048665]
We evaluate how robust metrics are to non-standardized dialects.
We collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects.
arXiv Detail & Related papers (2023-11-28T15:12:11Z) - Dialect Transfer for Swiss German Speech Translation [9.373232685350844]
This paper investigates the challenges in building Swiss German speech translation systems.
It focuses on the impact of dialect diversity and differences between Swiss German and Standard German.
arXiv Detail & Related papers (2023-10-13T13:16:57Z) - SwissBERT: The Multilingual Language Model for Switzerland [52.1701152610258]
SwissBERT is a masked language model created specifically for processing Switzerland-related text.
SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland.
Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work.
arXiv Detail & Related papers (2023-03-23T14:44:47Z) - A Swiss German Dictionary: Variation in Speech and Writing [45.82374977939355]
We introduce a dictionary containing forms of common words in various Swiss German dialects normalized into High German.
To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA)
This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions.
arXiv Detail & Related papers (2020-03-31T22:10:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.