SwissBERT: The Multilingual Language Model for Switzerland
- URL: http://arxiv.org/abs/2303.13310v3
- Date: Tue, 16 Jan 2024 16:24:36 GMT
- Title: SwissBERT: The Multilingual Language Model for Switzerland
- Authors: Jannis Vamvas and Johannes Gra\"en and Rico Sennrich
- Abstract summary: SwissBERT is a masked language model created specifically for processing Switzerland-related text.
SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland.
Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work.
- Score: 52.1701152610258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SwissBERT, a masked language model created specifically for
processing Switzerland-related text. SwissBERT is a pre-trained model that we
adapted to news articles written in the national languages of Switzerland --
German, French, Italian, and Romansh. We evaluate SwissBERT on natural language
understanding tasks related to Switzerland and find that it tends to outperform
previous models on these tasks, especially when processing contemporary news
and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be
extended to Swiss German dialects in future work. The model and our open-source
code are publicly released at https://github.com/ZurichNLP/swissbert.
Related papers
- Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents [10.819408603463428]
We present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose.
SwissBERT contains language adapters for the four national languages of Switzerland.
Experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model.
arXiv Detail & Related papers (2024-05-13T07:20:21Z) - Modular Adaptation of Multilingual Encoders to Written Swiss German
Dialect [52.1701152610258]
Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance.
For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
arXiv Detail & Related papers (2024-01-25T18:59:32Z) - ML-SUPERB: Multilingual Speech Universal PERformance Benchmark [73.65853301350042]
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.
This paper presents multilingual SUPERB, covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification.
Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features.
arXiv Detail & Related papers (2023-05-18T00:01:27Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - 2nd Swiss German Speech to Standard German Text Shared Task at SwissText
2022 [3.910747992453137]
The objective was to maximize the BLEU score on a test set of Grisons speech.
3 teams participated, with the best-performing system achieving a BLEU score of 70.1.
arXiv Detail & Related papers (2023-01-17T10:31:11Z) - Dialectal Speech Recognition and Translation of Swiss German Speech to
Standard German Text: Microsoft's Submission to SwissText 2021 [17.675379299410054]
Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland.
We propose a hybrid automatic speech recognition system with a lexicon that incorporates translations.
Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.
arXiv Detail & Related papers (2021-06-15T13:34:02Z) - SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference.
Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z) - A Swiss German Dictionary: Variation in Speech and Writing [45.82374977939355]
We introduce a dictionary containing forms of common words in various Swiss German dialects normalized into High German.
To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA)
This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions.
arXiv Detail & Related papers (2020-03-31T22:10:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.