Related papers: A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs

A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs

URL: http://arxiv.org/abs/2505.13173v2
Date: Sat, 31 May 2025 12:29:42 GMT
Title: A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs
Authors: V. S. D. S. Mahesh Akavarapu, Hrishikesh Terdalkar, Pramit Bhattacharyya, Shubhangi Agarwal, Vishakha Deulgaonkar, Pralay Manna, Chaitali Dangarikar, Arnab Bhattacharya,
Abstract summary: We focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin.<n>First, we explore named entity recognition and machine translation into English.<n>We show that incorporating context via retrieval-augmented generation approach significantly boosts performance.
Score: 3.4020284996081216
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin -- to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question-answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.

Related papers

IMPACT: Inflectional Morphology Probes Across Complex Typologies [0.0]
IMPACT is a synthetically generated evaluation framework focused on inflectional morphology.<n>It is designed to evaluate performance across five morphologically rich languages: Arabic, Russian, Finnish, Turkish, and Hebrew.<n>We assess eight multilingual LLMs that, despite strong English performance, struggle with other languages and uncommon morphological patterns.
arXiv Detail & Related papers (2025-06-30T14:58:23Z)
Under the Shadow of Babel: How Language Shapes Reasoning in LLMs [27.48119976373105]
We show that large language models internalize the habitual logical structures embedded in different languages.<n>Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English.
arXiv Detail & Related papers (2025-06-19T09:06:38Z)
Language Surgery in Multilingual Large Language Models [32.77326546076424]
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages.<n>This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers.<n>We propose Inference-Time Language Control (ITLC) to enable precise cross-lingual language control and mitigate language confusion.
arXiv Detail & Related papers (2025-06-14T11:09:50Z)
The Emergence of Abstract Thought in Large Language Models Beyond Any Language [95.50197866832772]
Large language models (LLMs) function effectively across a diverse range of languages.<n>Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts.<n>Recent results show strong multilingual performance, even surpassing English performance on specific tasks in other languages.
arXiv Detail & Related papers (2025-06-11T16:00:54Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning.<n>We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics.<n>Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z)
Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs [8.146860674148044]
We attempt to measure models' language understanding capacity while circumventing the risk of dataset recall.<n>We parameterize large families of language tasks recognized by deterministic finite automata (DFAs)<n>We find that, even in the strikingly simple setting of 3-state DFAs, LLMs underperform un parameterized ngram models on both language recognition and synthesis tasks.
arXiv Detail & Related papers (2025-01-06T07:57:51Z)
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We present the first study aimed at objectively assessing the fairness and robustness of Large Language Models (LLMs) in handling dialects in canonical reasoning tasks.<n>We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K.<n>Our findings reveal that textbfalmost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Unveiling the Generalization Power of Fine-Tuned Large Language Models [81.70754292058258]
We investigate whether fine-tuning affects the intrinsic generalization ability intrinsic to Large Language Models (LLMs) Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. We observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability.
arXiv Detail & Related papers (2024-03-14T08:18:59Z)
Unveiling Linguistic Regions in Large Language Models [49.298360366468934]
Large Language Models (LLMs) have demonstrated considerable cross-lingual alignment and generalization ability. This paper conducts several investigations on the linguistic competence of LLMs.
arXiv Detail & Related papers (2024-02-22T16:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.