Related papers: Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs

Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs

URL: http://arxiv.org/abs/2510.03762v1
Date: Sat, 04 Oct 2025 10:07:14 GMT
Title: Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs
Authors: Deshan Sumanathilaka, Nicholas Micallef, Julian Hough,
Abstract summary: This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task.<n>We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages.<n>Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English.
Score: 3.925313161884993
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its practicality and effectiveness. This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task, particularly focusing on the biases introduced by imbalanced sample distributions. We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages: English, German, Spanish, French, and Italian. Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English. To assess model behavior, we evaluate both the GPT-4o and LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual WSD to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.

Related papers

Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data [2.812898346527047]
This study investigates the capabilities of large language models (LLMs) for zero-shot and few-shot annotation of social media posts in Russian and Ukrainian.<n>To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels.<n>The study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability.
arXiv Detail & Related papers (2025-05-15T13:10:47Z)
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [86.98098988779809]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.<n>We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.<n>Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z)
Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement [22.992484902761994]
This study systematically evaluates the performance of multiple Large Language Models (LLMs) in detecting offensive language.<n>We analyze binary classification accuracy, examine the relationship between model confidence and human disagreement, and explore how disagreement samples influence model decision-making.
arXiv Detail & Related papers (2025-02-10T07:14:26Z)
A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context [0.9074663948713616]
Mental health disorders pose a growing public health concern in the Arab world.<n>This study comprehensively evaluates 8 large language models (LLMs) on diverse mental health datasets.
arXiv Detail & Related papers (2025-01-12T16:17:25Z)
SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment [78.4550589538805]
We propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism.<n> Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs.
arXiv Detail & Related papers (2025-01-07T10:29:43Z)
Counterfactual Samples Constructing and Training for Commonsense Statements Estimation [17.970740197590693]
Plausibility Estimation plays a crucial role for enabling language models to objectively comprehend the real world.<n>They lack two key traits of an ideal PE model: language-explainable and commonsense-sensitive.<n>We propose a novel model-agnostic method, referred to as Commonsense Counterfactual Samples Generating.
arXiv Detail & Related papers (2024-12-29T20:18:52Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis [23.757767581876063]
In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations. We show that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations.
arXiv Detail & Related papers (2024-02-20T12:53:31Z)
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference [38.1823640848362]
State-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English.
arXiv Detail & Related papers (2024-02-16T14:15:15Z)
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z)
Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.