Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents
- URL: http://arxiv.org/abs/2405.07513v1
- Date: Mon, 13 May 2024 07:20:21 GMT
- Title: Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents
- Authors: Juri Grosjean, Jannis Vamvas,
- Abstract summary: We present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose.
SwissBERT contains language adapters for the four national languages of Switzerland.
Experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model.
- Score: 10.819408603463428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Encoder models trained for the embedding of sentences or short documents have proven useful for tasks such as semantic search and topic modeling. In this paper, we present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland -- German, French, Italian, and Romansh -- and has been pre-trained on a large number of news articles in those languages. Using contrastive learning based on a subset of these articles, we trained a fine-tuned version, which we call SentenceSwissBERT. Multilingual experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model and of a comparable baseline. The model is openly available for research use.
Related papers
- Modular Adaptation of Multilingual Encoders to Written Swiss German
Dialect [52.1701152610258]
Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance.
For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
arXiv Detail & Related papers (2024-01-25T18:59:32Z) - Text-to-Speech Pipeline for Swiss German -- A comparison [2.7787719874237986]
We studied the synthesis of Swiss German speech using different Text-to-Speech (TTS) models.
We found that VITS models performed best, hence, using them for further testing.
arXiv Detail & Related papers (2023-05-31T11:33:18Z) - SwissBERT: The Multilingual Language Model for Switzerland [52.1701152610258]
SwissBERT is a masked language model created specifically for processing Switzerland-related text.
SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland.
Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work.
arXiv Detail & Related papers (2023-03-23T14:44:47Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Scribosermo: Fast Speech-to-Text models for German and other Languages [69.7571480246023]
This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features.
They are small and run in real-time on microcontrollers like a RaspberryPi.
Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset.
arXiv Detail & Related papers (2021-10-15T10:10:34Z) - SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference.
Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - Playing with Words at the National Library of Sweden -- Making a Swedish
BERT [0.0]
This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLab for data-driven research at the National Library of Sweden (KB)
Building on recent efforts to create transformer-based BERT models for languages other than English, we explain how we used KB's collections to create and train a new language-specific BERT model for Swedish.
arXiv Detail & Related papers (2020-07-03T12:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.