Related papers: SwissBERT: The Multilingual Language Model for Switzerland

SwissBERT: The Multilingual Language Model for Switzerland

URL: http://arxiv.org/abs/2303.13310v3
Date: Tue, 16 Jan 2024 16:24:36 GMT
Title: SwissBERT: The Multilingual Language Model for Switzerland
Authors: Jannis Vamvas and Johannes Gra\"en and Rico Sennrich
Abstract summary: SwissBERT is a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work.
Score: 52.1701152610258
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at https://github.com/ZurichNLP/swissbert.

Related papers

SwissGPC v1.0 -- The Swiss German Podcasts Corpus [7.0900880654100895]
We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech.<n>The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio.<n>After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German.
arXiv Detail & Related papers (2025-09-24T08:13:44Z)
Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z)
Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents [10.819408603463428]
We present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland. Experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model.
arXiv Detail & Related papers (2024-05-13T07:20:21Z)
Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect [52.1701152610258]
Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance. For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
arXiv Detail & Related papers (2024-01-25T18:59:32Z)
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark [73.65853301350042]
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. This paper presents multilingual SUPERB, covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features.
arXiv Detail & Related papers (2023-05-18T00:01:27Z)
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages. We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z)
2nd Swiss German Speech to Standard German Text Shared Task at SwissText 2022 [3.910747992453137]
The objective was to maximize the BLEU score on a test set of Grisons speech. 3 teams participated, with the best-performing system achieving a BLEU score of 70.1.
arXiv Detail & Related papers (2023-01-17T10:31:11Z)
Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft's Submission to SwissText 2021 [17.675379299410054]
Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland. We propose a hybrid automatic speech recognition system with a lexicon that incorporates translations. Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.
arXiv Detail & Related papers (2021-06-15T13:34:02Z)
SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference. Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z)
A Swiss German Dictionary: Variation in Speech and Writing [45.82374977939355]
We introduce a dictionary containing forms of common words in various Swiss German dialects normalized into High German. To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA) This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions.
arXiv Detail & Related papers (2020-03-31T22:10:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.