Related papers: Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

URL: http://arxiv.org/abs/2308.13116v1
Date: Thu, 24 Aug 2023 23:38:44 GMT
Title: Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation
Authors: Kevin Krahn, Derrick Tate, Andrew C. Lamicela
Abstract summary: We use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text. We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations. We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contextual language models have been trained on Classical languages, including Ancient Greek and Latin, for tasks such as lemmatization, morphological tagging, part of speech tagging, authorship attribution, and detection of scribal errors. However, high-quality sentence embedding models for these historical languages are significantly more difficult to achieve due to the lack of training data. In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text. The state-of-the-art sentence embedding approaches for high-resource languages use massive datasets, but our distillation approach allows our Ancient Greek models to inherit the properties of these models while using a relatively small amount of translated sentence data. We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations, and use this dataset to train our models. We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks and investigate translation bias. We make our training and evaluation datasets freely available at https://github.com/kevinkrahn/ancient-greek-datasets .

Related papers

Adapting Multilingual Embedding Models to Historical Luxembourgish [5.474797258314828]
This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish. We use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair. We adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models.
arXiv Detail & Related papers (2025-02-11T20:35:29Z)
Exploring Large Language Models for Classical Philology [17.856304057963776]
We create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages. We evaluate all models on morphological and syntactic tasks, including lemmatization. Results show that our models provide significant improvements over the SoTA.
arXiv Detail & Related papers (2023-05-23T05:21:02Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z)
Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach [8.00388161728995]
We present models which complete missing text given transliterations of ancient Mesopotamian documents. Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text.
arXiv Detail & Related papers (2021-09-09T18:58:14Z)
Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages. We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.