Related papers: Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data

Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data

URL: http://arxiv.org/abs/2510.07434v1
Date: Wed, 08 Oct 2025 18:34:00 GMT
Title: Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data
Authors: Olia Toporkov, Alan Akbik, Rodrigo Agerri,
Abstract summary: Lemmatization is the task of transforming all words in a given text to their dictionary forms.<n>There is no prior evidence of how effective large language models are in the contextual lemmatization task.<n>This paper empirically investigates the capacity of the latest generation of LLMs to perform in-context lemmatization.
Score: 18.87770758217633
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code available upon publication: https://github.com/oltoporkov/lemma-dilemma

Related papers

Optimizing Language Models for Crosslingual Knowledge Consistency [90.86445137816942]
Large language models are known to often exhibit inconsistent knowledge.<n>This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages.<n>In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function.
arXiv Detail & Related papers (2026-03-04T23:36:55Z)
It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models [1.6407393639625105]
MHEL-LLaMo is an unsupervised ensemble approach combining a Small Language Model (SLM) and an LLM.<n>We evaluate MHEL-LLaMo on four established benchmarks in six European languages.<n>Results demonstrate that MHEL-LLaMo outperforms state-of-the-art models without requiring fine-tuning.
arXiv Detail & Related papers (2026-01-13T12:36:38Z)
Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains [6.357124887141297]
Large Language Models (LLMs) have redefined Machine Translation (MT)<n>LLMs often exhibit uneven performance across language families and specialized domains.<n>We introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs.
arXiv Detail & Related papers (2025-10-09T07:28:30Z)
Constrained Decoding of Diffusion LLMs with Context-Free Grammars [1.0923877073891446]
Large language models (LLMs) have shown promising performance across diverse domains.<n>This paper presents the first constrained decoding method for diffusion models.<n>We show that our method achieves near-perfect syntactic correctness while consistently preserving or improving functional correctness.
arXiv Detail & Related papers (2025-08-13T18:09:09Z)
Enhancing Large Language Models'Machine Translation via Dynamic Focus Anchoring [22.297388572921477]
Large language models have demonstrated exceptional performance across multiple crosslingual NLP tasks, including machine translation (MT)<n> persistent challenges remain in addressing context-sensitive units (CSUs), such as polysemous words.<n>We propose a simple but effective method to enhance LLMs' MT capabilities by acquiring CSUs and applying semantic focus.
arXiv Detail & Related papers (2025-05-29T06:29:57Z)
On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data [1.2979906794584584]
The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored.<n>In this paper we work on this topic, focusing on structured and semi-structured anonymized data.<n>We identify and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components.
arXiv Detail & Related papers (2025-04-10T10:48:42Z)
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z)
PICASO: Permutation-Invariant Context Composition with State Space Models [98.91198288025117]
State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states.<n>We propose a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating raw context tokens.<n>We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
arXiv Detail & Related papers (2025-02-24T19:48:00Z)
Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges. These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z)
Adapting Large Language Models for Document-Level Machine Translation [46.370862171452444]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks. Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning. This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs.
arXiv Detail & Related papers (2024-01-12T09:29:13Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
Towards Effective Disambiguation for Machine Translation with Large Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences" Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.