Related papers: Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

URL: http://arxiv.org/abs/2508.09091v3
Date: Sat, 08 Nov 2025 09:14:23 GMT
Title: Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages
Authors: Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan,
Abstract summary: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training.<n>We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM.<n>We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam.
Score: 4.702593857707973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM's embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.

Related papers

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages [9.066355705304984]
We propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages.<n>By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder.<n>Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks.
arXiv Detail & Related papers (2025-02-15T16:53:10Z)
Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages [10.418542753869433]
Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data.<n>Current state-of-the-art large language models (LLMs) still struggle with LRLs.<n>Small multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes.
arXiv Detail & Related papers (2025-02-14T13:10:39Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? [4.616570111453259]
Large language models (LLMs) exhibit state-of-the-art performance in various tasks, but struggle for code translation. We conduct a large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks. We propose two methods: (1) intermediary translation which selects an intermediary language between the source and target ones; and (2) self-training which fine-tunes LLMs on self-generated parallel data.
arXiv Detail & Related papers (2024-10-13T12:20:12Z)
LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages [36.52198103816494]
Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks. But their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. We conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages.
arXiv Detail & Related papers (2024-07-08T14:18:28Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation [104.10726545151043]
multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs. Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
arXiv Detail & Related papers (2020-10-04T19:42:40Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.