Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge
Distillation
- URL: http://arxiv.org/abs/2308.13116v1
- Date: Thu, 24 Aug 2023 23:38:44 GMT
- Title: Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge
Distillation
- Authors: Kevin Krahn, Derrick Tate, Andrew C. Lamicela
- Abstract summary: We use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations.
We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual language models have been trained on Classical languages,
including Ancient Greek and Latin, for tasks such as lemmatization,
morphological tagging, part of speech tagging, authorship attribution, and
detection of scribal errors. However, high-quality sentence embedding models
for these historical languages are significantly more difficult to achieve due
to the lack of training data. In this work, we use a multilingual knowledge
distillation approach to train BERT models to produce sentence embeddings for
Ancient Greek text. The state-of-the-art sentence embedding approaches for
high-resource languages use massive datasets, but our distillation approach
allows our Ancient Greek models to inherit the properties of these models while
using a relatively small amount of translated sentence data. We build a
parallel sentence dataset using a sentence-embedding alignment method to align
Ancient Greek documents with English translations, and use this dataset to
train our models. We evaluate our models on translation search, semantic
similarity, and semantic retrieval tasks and investigate translation bias. We
make our training and evaluation datasets freely available at
https://github.com/kevinkrahn/ancient-greek-datasets .
Related papers
- Exploring Large Language Models for Classical Philology [17.856304057963776]
We create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages.
We evaluate all models on morphological and syntactic tasks, including lemmatization.
Results show that our models provide significant improvements over the SoTA.
arXiv Detail & Related papers (2023-05-23T05:21:02Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Training Effective Neural Sentence Encoders from Automatically Mined
Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data.
Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora.
Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z) - Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling
Approach [8.00388161728995]
We present models which complete missing text given transliterations of ancient Mesopotamian documents.
Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text.
arXiv Detail & Related papers (2021-09-09T18:58:14Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.