Related papers: Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

URL: http://arxiv.org/abs/2602.15753v1
Date: Tue, 17 Feb 2026 17:34:32 GMT
Title: Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac
Authors: Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero,
Abstract summary: Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech tagging.<n>This paper investigates the capacity of recent large language models to address these tasks in few-shot and zero-shot settings.
Score: 0.08496348835248901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

Related papers

Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning [0.0]
This paper presents a benchmark of several open-source Large Language Models (LLMs) for Persian Natural Language Processing tasks.<n>We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering.<n>The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms.
arXiv Detail & Related papers (2025-10-05T10:10:04Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices [0.3141085922386211]
Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning.<n>Their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices.<n>This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages.
arXiv Detail & Related papers (2025-04-04T09:47:58Z)
Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan [0.1979158763744267]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing.<n>This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan.
arXiv Detail & Related papers (2025-03-10T20:16:01Z)
Analysis of LLM as a grammatical feature tagger for African American English [0.6927055673104935]
African American English (AAE) presents unique challenges in natural language processing (NLP)<n>This research systematically compares the performance of available NLP models.<n>This study highlights the necessity for improved model training and architectural adjustments to better accommodate AAE's unique linguistic characteristics.
arXiv Detail & Related papers (2025-02-09T19:46:33Z)
Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.<n>Currently, instruction-tuned large language models (LLMs) excel at various English tasks.<n>Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z)
EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation [24.060772057458685]
This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English. We evaluate the performance of these models across five downstream Natural Language Processing (NLP) tasks.
arXiv Detail & Related papers (2024-03-20T16:43:42Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.