Related papers: Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

URL: http://arxiv.org/abs/2602.03551v1
Date: Tue, 03 Feb 2026 14:02:06 GMT
Title: Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models
Authors: Vitalii Hirak, Jaap Jumelet, Arianna Bisazza,
Abstract summary: typological properties have been proposed to determine the intrinsic difficulty of modeling a language.<n>We analyze two large pre-trained multilingual translation models, NLLB-200 and Tower+.<n>Based on a broad set of languages, we find that target language typology drives translation quality of both models.
Score: 11.604740935992147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.

Related papers

Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning [0.6599344783327054]
We study how training on machine-translated data affects small English language models.<n>We train models on English text translated from 24 typologically and resource-diverse source languages.
arXiv Detail & Related papers (2026-02-18T13:59:08Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z)
HiJoNLP at SemEval-2022 Task 2: Detecting Idiomaticity of Multiword Expressions using Multilingual Pretrained Language Models [0.6091702876917281]
This paper describes an approach to detect idiomaticity only from the contextualized representation of a MWE over multilingual pretrained language models. Our experiments find that larger models are usually more effective in idiomaticity detection. However, using a higher layer of the model may not guarantee a better performance.
arXiv Detail & Related papers (2022-05-27T01:55:59Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers. We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
An Empirical Study of Factors Affecting Language-Independent Models [11.976665726887733]
We show that language-independent models can be comparable to or even outperforms the models trained using monolingual data. We experiment language-independent models with many different languages and show that they are more suitable for typologically similar languages.
arXiv Detail & Related papers (2019-12-30T22:41:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.