A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek
- URL: http://arxiv.org/abs/2410.12055v1
- Date: Tue, 15 Oct 2024 20:49:48 GMT
- Title: A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek
- Authors: Giuseppe G. A. Celano,
- Abstract summary: This paper presents an experiment consisting in the comparison of six models to identify a state-of-the-art morphosyntactic and tizer for Ancient Greek texts.
A normalized version of the major collections of annotated texts was used to train the baseline model Dithrax with randomly character embeddings.
A Bayesian analysis shows that Dithrax and Trankit morphology annotate practically equivalently, while syntax is best annotated by Trankit and lemmata by GreTa.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents an experiment consisting in the comparison of six models to identify a state-of-the-art morphosyntactic parser and lemmatizer for Ancient Greek capable of annotating according to the Ancient Greek Dependency Treebank annotation scheme. A normalized version of the major collections of annotated texts was used to (i) train the baseline model Dithrax with randomly initialized character embeddings and (ii) fine-tune Trankit and four recent models pretrained on Ancient Greek texts, i.e., GreBERTa and PhilBERTa for morphosyntactic annotation and GreTA and PhilTa for lemmatization. A Bayesian analysis shows that Dithrax and Trankit annotate morphology practically equivalently, while syntax is best annotated by Trankit and lemmata by GreTa. The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships. The dataset and best-performing models are made available online for reuse.
Related papers
- PhiloBERTA: A Transformer-Based Cross-Lingual Analysis of Greek and Latin Lexicons [0.0]
We present PhiloBERTA, a model that measures semantic relationships between ancient Greek and Latin lexicons.
Our results show that etymologically related pairs demonstrate significantly higher similarity scores.
These findings establish a quantitative framework for examining how philosophical concepts moved between Greek and Latin traditions.
arXiv Detail & Related papers (2025-03-07T09:30:16Z) - GreekT5: A Series of Greek Sequence-to-Sequence Models for News
Summarization [0.0]
This paper proposes a series of novel TS models for Greek news articles.
The proposed models were thoroughly evaluated on the same dataset against GreekBART.
Our evaluation results reveal that most of the proposed models significantly outperform GreekBART on various evaluation metrics.
arXiv Detail & Related papers (2023-11-13T21:33:12Z) - Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge
Distillation [0.0]
We use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations.
We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks.
arXiv Detail & Related papers (2023-08-24T23:38:44Z) - Exploring Large Language Models for Classical Philology [17.856304057963776]
We create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages.
We evaluate all models on morphological and syntactic tasks, including lemmatization.
Results show that our models provide significant improvements over the SoTA.
arXiv Detail & Related papers (2023-05-23T05:21:02Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Few-shot Text Classification with Dual Contrastive Consistency [31.141350717029358]
In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification.
We adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data.
arXiv Detail & Related papers (2022-09-29T19:26:23Z) - Morphological Reinflection with Multiple Arguments: An Extended
Annotation schema and a Georgian Case Study [7.245355976804435]
We extend the UniMorph morphological dataset to cover verbs that agree with multiple arguments using true affixes.
The dataset has 4 times more tables and 6 times more verb forms compared to the existing UniMorph dataset.
It is expected to improve the coverage, consistency and interpretability of this benchmark.
arXiv Detail & Related papers (2022-03-16T10:47:29Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Lexical semantic change for Ancient Greek and Latin [61.69697586178796]
Associating a word's correct meaning in its historical context is a central challenge in diachronic research.
We build on a recent computational approach to semantic change based on a dynamic Bayesian mixture model.
We provide a systematic comparison of dynamic Bayesian mixture models for semantic change with state-of-the-art embedding-based models.
arXiv Detail & Related papers (2021-01-22T12:04:08Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - Understanding Neural Abstractive Summarization Models via Uncertainty [54.37665950633147]
seq2seq abstractive summarization models generate text in a free-form manner.
We study the entropy, or uncertainty, of the model's token-level predictions.
We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
arXiv Detail & Related papers (2020-10-15T16:57:27Z) - Temporal Embeddings and Transformer Models for Narrative Text
Understanding [72.88083067388155]
We present two approaches to narrative text understanding for character relationship modelling.
The temporal evolution of these relations is described by dynamic word embeddings, that are designed to learn semantic changes over time.
A supervised learning approach based on the state-of-the-art transformer model BERT is used instead to detect static relations between characters.
arXiv Detail & Related papers (2020-03-19T14:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.