Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages
- URL: http://arxiv.org/abs/2602.01162v1
- Date: Sun, 01 Feb 2026 11:22:30 GMT
- Title: Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages
- Authors: Nipuna Abeykoon, Ashen Weerathunga, Pubudu Wijesinghe, Parameswari Krishnamurthy,
- Abstract summary: Large language models trained predominantly on high-resource languages exhibit systematic biases toward dominant typological patterns.<n>We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining.
- Score: 1.0732935873226022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models trained predominantly on high-resource languages exhibit systematic biases toward dominant typological patterns, leading to structural non-conformance when translating into typologically divergent low-resource languages. We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining. The framework consists of two components: the Universal Metalinguistic Framework (UMF), which represents languages as structured profiles across 16 typological dimensions with divergence-weighted scoring, and the Computational Engine, which operates through linguistic disambiguation during generation and typological compliance scoring during selection. Evaluation across nine language pairs demonstrates intervention rates strongly correlating with typological distance from English. In experiments on 341 English sentences each having different morphological and syntactic phenomena, the framework shows an intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. The framework requires no parallel training data and operates with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages.
Related papers
- Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models [11.604740935992147]
typological properties have been proposed to determine the intrinsic difficulty of modeling a language.<n>We analyze two large pre-trained multilingual translation models, NLLB-200 and Tower+.<n>Based on a broad set of languages, we find that target language typology drives translation quality of both models.
arXiv Detail & Related papers (2026-02-03T14:02:06Z) - MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages [33.450081592217074]
We introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities.<n>We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage.
arXiv Detail & Related papers (2025-06-24T09:53:00Z) - Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages [9.426642998924724]
Multilingual large language models generate non-faithful output in resource-constrained languages.<n>To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures.<n>We show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33%.
arXiv Detail & Related papers (2025-02-24T21:22:19Z) - A Principled Framework for Evaluating on Typologically Diverse Languages [11.512231869456512]
We present a language sampling framework for selecting highly typologically diverse languages.<n>We find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP.
arXiv Detail & Related papers (2024-07-06T09:31:02Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - An Empirical Study of Factors Affecting Language-Independent Models [11.976665726887733]
We show that language-independent models can be comparable to or even outperforms the models trained using monolingual data.
We experiment language-independent models with many different languages and show that they are more suitable for typologically similar languages.
arXiv Detail & Related papers (2019-12-30T22:41:57Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.<n>Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.