Related papers: Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents

Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents

URL: http://arxiv.org/abs/2405.07513v1
Date: Mon, 13 May 2024 07:20:21 GMT
Title: Fine-tuning the SwissBERT Encoder Model for Embedding Sentences and Documents
Authors: Juri Grosjean, Jannis Vamvas,
Abstract summary: We present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland. Experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model.
Score: 10.819408603463428
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Encoder models trained for the embedding of sentences or short documents have proven useful for tasks such as semantic search and topic modeling. In this paper, we present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland -- German, French, Italian, and Romansh -- and has been pre-trained on a large number of news articles in those languages. Using contrastive learning based on a subset of these articles, we trained a fine-tuned version, which we call SentenceSwissBERT. Multilingual experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model and of a comparable baseline. The model is openly available for research use.

Related papers

Bilingual BSARD: Extending Statutory Article Retrieval to Dutch [3.11149191866066]
This dataset contains parallel Belgian statutory articles in both French and Dutch, along with legal questions from BSARD and their Dutch translation. We conduct extensive benchmarking of retrieval models available for Dutch and French. Our experiments show that BM25 remains a competitive baseline compared to many zero-shot dense models in both languages.
arXiv Detail & Related papers (2024-12-10T12:31:33Z)
Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect [52.1701152610258]
Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance. For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
arXiv Detail & Related papers (2024-01-25T18:59:32Z)
Text-to-Speech Pipeline for Swiss German -- A comparison [2.7787719874237986]
We studied the synthesis of Swiss German speech using different Text-to-Speech (TTS) models. We found that VITS models performed best, hence, using them for further testing.
arXiv Detail & Related papers (2023-05-31T11:33:18Z)
SwissBERT: The Multilingual Language Model for Switzerland [52.1701152610258]
SwissBERT is a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work.
arXiv Detail & Related papers (2023-03-23T14:44:47Z)
Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM) MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z)
Scribosermo: Fast Speech-to-Text models for German and other Languages [69.7571480246023]
This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features. They are small and run in real-time on microcontrollers like a RaspberryPi. Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset.
arXiv Detail & Related papers (2021-10-15T10:10:34Z)
SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference. Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
It's not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages. We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z)
Playing with Words at the National Library of Sweden -- Making a Swedish BERT [0.0]
This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLab for data-driven research at the National Library of Sweden (KB) Building on recent efforts to create transformer-based BERT models for languages other than English, we explain how we used KB's collections to create and train a new language-specific BERT model for Swedish.
arXiv Detail & Related papers (2020-07-03T12:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.