Adapting Multilingual Embedding Models to Historical Luxembourgish
- URL: http://arxiv.org/abs/2502.07938v3
- Date: Thu, 13 Mar 2025 13:19:30 GMT
- Title: Adapting Multilingual Embedding Models to Historical Luxembourgish
- Authors: Andrianos Michail, Corina Julia Raclé, Juri Opitz, Simon Clematide,
- Abstract summary: This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish.<n>We use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair.<n>We adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models.
- Score: 5.474797258314828
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models face challenges with historical content due to OCR noise and outdated spellings. This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish (LB), a low-resource language. We collect historical Luxembourgish news articles from various periods and use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair. Additionally, we create a semantic search (Historical LB Bitext Mining) evaluation set and find that existing models perform poorly on cross-lingual search for historical Luxembourgish. Using our historical and additional modern parallel training data, we adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models. We release our adapted models and historical Luxembourgish-German/French/English bitexts to support further research.
Related papers
- mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.
We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.
We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy [7.59001382786429]
This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish.<n>We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts of German and French data.<n>For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.
arXiv Detail & Related papers (2024-12-12T16:23:12Z) - A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives [2.3592914313389257]
We are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance.
Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models.
arXiv Detail & Related papers (2024-07-24T11:03:47Z) - A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - MENTOR: Multilingual tExt detectioN TOward leaRning by analogy [59.37382045577384]
We propose a framework to detect and identify both seen and unseen language regions inside scene images.
"MENTOR" is the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
arXiv Detail & Related papers (2024-03-12T03:35:17Z) - Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge
Distillation [0.0]
We use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations.
We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks.
arXiv Detail & Related papers (2023-08-24T23:38:44Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - hmBERT: Historical Multilingual Language Models for Named Entity
Recognition [0.6226609932118123]
We tackle NER for identifying persons, locations, and organizations in historical texts.
In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models.
arXiv Detail & Related papers (2022-05-31T07:30:33Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Summarising Historical Text in Modern Languages [13.886432536330805]
We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
arXiv Detail & Related papers (2021-01-26T13:00:07Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.