Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
- URL: http://arxiv.org/abs/2602.22207v1
- Date: Wed, 25 Feb 2026 18:58:25 GMT
- Title: Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
- Authors: Hanna Yukhymenko, Anton Alexandrov, Martin Vechev,
- Abstract summary: We present a fully automated framework designed to enable scalable, high-quality translation of datasets and benchmarks.<n>We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages.
- Score: 2.0199251985015434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
Related papers
- BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR [0.06363400715351396]
This work presents a Bangla IR dataset constructed using a BETA-labeling framework.<n>We examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation.
arXiv Detail & Related papers (2026-02-16T06:04:04Z) - Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation [49.82863380286994]
In-context learning may offer novel ways to adapt Large Language Models for low-resource machine translation.<n>In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models.<n>Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window.
arXiv Detail & Related papers (2026-02-04T17:02:22Z) - Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection [0.15229257192293197]
Low-resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources.<n>Recent advances in large language models suggest that self-reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency.<n>This paper introduces Reflective Translation, a prompt-based framework in which a model generates an initial translation, produces a structured self-critique, and then uses this reflection to generate a refined translation.
arXiv Detail & Related papers (2026-01-27T18:37:09Z) - Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis [4.061135251278187]
TrGLUE is a benchmark for evaluating natural language understanding in Turkish.<n>We also present SentiTurca, a specialized benchmark for sentiment analysis.<n> TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations.
arXiv Detail & Related papers (2025-12-26T18:02:09Z) - Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation [4.45354703148321]
Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance.<n>We present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation.
arXiv Detail & Related papers (2025-11-28T12:40:30Z) - Structured Prompting Enables More Robust Evaluation of Language Models [38.53918044830268]
We present a DSPy+HELM framework that introduces structured prompting methods which elicit reasoning.<n>We find that without structured prompting, HELM underestimates LM performance (by 4% average) and performance estimates vary more across benchmarks.<n>This is the first benchmarking study to systematically integrate structured prompting into an established evaluation framework.
arXiv Detail & Related papers (2025-11-25T20:37:59Z) - LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark [1.3927943269211591]
We propose a comprehensive framework that enhances Large Language Models (LLMs)-based machine translation evaluation.<n>We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers.<n>Our evaluation shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation.
arXiv Detail & Related papers (2025-05-18T07:24:13Z) - Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data [18.150384435635477]
In many settings, monolingual corpora in the target language are often available.<n>We design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives.<n>We also showcase our method on a real-world settings, using much larger monolingual and observe strong improvements over both the baseline setting and general-purpose cross-lingual retrievers.
arXiv Detail & Related papers (2025-04-30T15:41:03Z) - Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data [64.4458540273004]
We propose a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of Large Language Models (LLMs)<n>Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions.
arXiv Detail & Related papers (2025-04-20T16:20:30Z) - RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [50.65321080814249]
RustRepoTrans is the first repository-level context code translation benchmark targeting incremental translation.<n>We evaluate seven representative LLMs, analyzing their errors to assess limitations in complex translation scenarios.
arXiv Detail & Related papers (2024-11-21T10:00:52Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.