Translation as a Scalable Proxy for Multilingual Evaluation
- URL: http://arxiv.org/abs/2601.11778v1
- Date: Fri, 16 Jan 2026 21:01:40 GMT
- Title: Translation as a Scalable Proxy for Multilingual Evaluation
- Authors: Sheriff Issaka, Erick Rosas Gonzalez, Lieqi Liu, Evans Kofi Agyei, Lucas Bandarkar, Nanyun Peng, David Ifeoluwa Adelani, Francisco Guzmán, Saadia Gabriel,
- Abstract summary: We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities?<n>We find that translation performance is a good indicator of downstream task success.
- Score: 41.29816736489213
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving >98% of the world's 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: MetricX = 0.89, xCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful translation overlap with those required for multilingual understanding. Translation quality, thus emerges as a strong, inexpensive first-pass proxy of multilingual performance, enabling a translation-first screening with targeted follow-up for specific tasks.
Related papers
- DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation [31.1561882673283]
DITING is the first comprehensive evaluation framework for web novel translation.<n>AgentEval simulates expert deliberation to assess translation quality beyond lexical overlap.<n>We develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores.
arXiv Detail & Related papers (2025-10-10T08:10:10Z) - Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification [35.35733615199578]
We compare translation-based and language-specific/multilingual classification pipelines.<n>We find translation benefits strongly correlated with both the resource level of the target language and the quality of the machine translation system.
arXiv Detail & Related papers (2025-09-17T23:58:07Z) - When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification [14.187153195380668]
Large language models have remarkable capabilities across many NLP tasks, but their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied.<n>We evaluate five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories.<n>Surprisingly, we find that XLM-R substantially outperforms all tested LLMs, achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%.
arXiv Detail & Related papers (2025-07-28T10:49:04Z) - Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters [53.59868121093848]
We introduce Seed-X, a family of open-source language models (LLMs) with 7B parameter size.<n>The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages.<n>The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs.
arXiv Detail & Related papers (2025-07-18T03:19:43Z) - Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval [0.0]
This paper presents the approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval.<n>We implement a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies.
arXiv Detail & Related papers (2025-05-19T01:58:22Z) - LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs)<n>The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately.<n>The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.