Modular Adaptation of Multilingual Encoders to Written Swiss German
Dialect
- URL: http://arxiv.org/abs/2401.14400v1
- Date: Thu, 25 Jan 2024 18:59:32 GMT
- Title: Modular Adaptation of Multilingual Encoders to Written Swiss German
Dialect
- Authors: Jannis Vamvas, No\"emi Aepli, Rico Sennrich
- Abstract summary: Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance.
For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
- Score: 52.1701152610258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Creating neural text encoders for written Swiss German is challenging due to
a dearth of training data combined with dialectal variation. In this paper, we
build on several existing multilingual encoders and adapt them to Swiss German
using continued pre-training. Evaluation on three diverse downstream tasks
shows that simply adding a Swiss German adapter to a modular encoder achieves
97.5% of fully monolithic adaptation performance. We further find that for the
task of retrieving Swiss German sentences given Standard German queries,
adapting a character-level model is more effective than the other adaptation
strategies. We release our code and the models trained for our experiments at
https://github.com/ZurichNLP/swiss-german-text-encoders
Related papers
- Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents [10.819408603463428]
We present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose.
SwissBERT contains language adapters for the four national languages of Switzerland.
Experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model.
arXiv Detail & Related papers (2024-05-13T07:20:21Z) - A Benchmark for Evaluating Machine Translation Metrics on Dialects
Without Standard Orthography [40.04973667048665]
We evaluate how robust metrics are to non-standardized dialects.
We collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects.
arXiv Detail & Related papers (2023-11-28T15:12:11Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - SwissBERT: The Multilingual Language Model for Switzerland [52.1701152610258]
SwissBERT is a masked language model created specifically for processing Switzerland-related text.
SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland.
Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work.
arXiv Detail & Related papers (2023-03-23T14:44:47Z) - Multilingual Unsupervised Neural Machine Translation with Denoising
Adapters [77.80790405710819]
We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data.
For this problem the standard procedure so far to leverage the monolingual data is back-translation, which is computationally costly and hard to tune.
In this paper we propose instead to use denoising adapters, adapter layers with a denoising objective, on top of pre-trained mBART-50.
arXiv Detail & Related papers (2021-10-20T10:18:29Z) - Dialectal Speech Recognition and Translation of Swiss German Speech to
Standard German Text: Microsoft's Submission to SwissText 2021 [17.675379299410054]
Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland.
We propose a hybrid automatic speech recognition system with a lexicon that incorporates translations.
Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.
arXiv Detail & Related papers (2021-06-15T13:34:02Z) - SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference.
Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - A Swiss German Dictionary: Variation in Speech and Writing [45.82374977939355]
We introduce a dictionary containing forms of common words in various Swiss German dialects normalized into High German.
To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA)
This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions.
arXiv Detail & Related papers (2020-03-31T22:10:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.