Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages
- URL: http://arxiv.org/abs/2506.17715v1
- Date: Sat, 21 Jun 2025 13:33:07 GMT
- Title: Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages
- Authors: Matthias Schöffel, Esteban Garces Arias, Marinus Wiedner, Paula Ruppert, Meimingwei Li, Christian Heumann, Matthias Aßenmacher,
- Abstract summary: Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines.<n>This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts.
- Score: 0.18846515534317265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines, particularly critical for historical text analysis at the intersection of computational linguistics and digital humanities. Despite significant advancements in modern large language models (LLMs) for ancient languages, their application to Medieval Romance languages presents distinctive challenges stemming from diachronic linguistic evolution, spelling variations, and labeled data scarcity. This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts, spanning biblical, hagiographical, medical, and dietary domains. Through rigorous experimentation, we evaluate how fine-tuning approaches, prompt engineering, model architectures, decoding strategies, and cross-lingual transfer learning techniques affect tagging accuracy. Our results reveal both notable limitations in LLMs' ability to process historical language variations and non-standardized spelling, as well as promising specialized techniques that effectively address the unique challenges presented by low-resource historical languages.
Related papers
- Two Spelling Normalization Approaches Based on Large Language Models [3.207455883863625]
spelling normalization endeavors to align a document's orthography with contemporary standards.<n>We propose two new approaches based on large language models, one of which has been trained without a supervised training, and a second one which has been trained for machine translation.<n>Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.
arXiv Detail & Related papers (2025-06-29T15:25:09Z) - Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone [4.795582035438343]
There is an urgent need of computational techniques able to adapt to the challenges of historical texts.<n>The rise of large language models (LLMs) has revolutionized natural language processing.<n>No thorough evaluation has been proposed for Italian texts.
arXiv Detail & Related papers (2025-05-26T15:16:48Z) - Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan [0.1979158763744267]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing.<n>This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan.
arXiv Detail & Related papers (2025-03-10T20:16:01Z) - Adapting Multilingual Embedding Models to Historical Luxembourgish [5.474797258314828]
This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish.<n>We use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair.<n>We adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models.
arXiv Detail & Related papers (2025-02-11T20:35:29Z) - NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach [0.03187482513047917]
We show how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks for NER in historical documents.<n>Our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools.
arXiv Detail & Related papers (2025-02-04T16:54:23Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.