Related papers: Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

URL: http://arxiv.org/abs/2602.02287v1
Date: Mon, 02 Feb 2026 16:27:32 GMT
Title: Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
Authors: Isaac Chung, Linda Freienthal,
Abstract summary: Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability.<n>We investigate evaluation reliability by holding generation conditions constant while varying target language.<n>Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages.
Score: 0.22009842278462158
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.

Related papers

Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation [1.9102169745315323]
Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference.<n>This limitation constrains both model training and evaluation.<n>We introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references.
arXiv Detail & Related papers (2026-01-29T00:02:19Z)
Benchmarking Concept-Spilling Across Languages in LLMs [7.577675422356702]
Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward representations from other languages.<n>This paper presents a novel comparative framework for evaluating multilingual semantic robustness by measuring how models handle polysemous words across languages.
arXiv Detail & Related papers (2026-01-18T19:28:26Z)
On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z)
Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation [49.2073409243885]
Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency.<n>We conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages.<n>We identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages.
arXiv Detail & Related papers (2026-01-01T08:53:49Z)
Conditions for Catastrophic Forgetting in Multilingual Translation [24.10629800866219]
We identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning.<n>We show that the relative scale between model and data size is a primary determinant of forgetting.<n>We also show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.
arXiv Detail & Related papers (2025-10-22T12:54:00Z)
Does Language Model Understand Language? [1.0450509067356148]
Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena.<n>In this study, we conduct a evaluation of SOTA language models across challenging contexts in both English and Bengali.<n>Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions.
arXiv Detail & Related papers (2025-09-15T21:09:09Z)
SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation [29.960223851833785]
We propose SiLVERScore, a semantically-aware embedding-based evaluation metric for sign language generation.<n>On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs.
arXiv Detail & Related papers (2025-09-04T00:58:43Z)
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z)
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [56.61984030508691]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation [86.19634542434711]
Mauve measures an information-theoretic divergence between two probability distributions over strings. We show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. We conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
arXiv Detail & Related papers (2022-05-31T17:58:49Z)
Cross-lingual Spoken Language Understanding with Regularized Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource. Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.