Related papers: Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages

Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages

URL: http://arxiv.org/abs/2511.06497v1
Date: Sun, 09 Nov 2025 18:54:17 GMT
Title: Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages
Authors: Quang Phuoc Nguyen, David Anugraha, Felix Gaschi, Jun Bin Cheng, En-Shiun Annie Lee,
Abstract summary: Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models.<n>We investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer.
Score: 4.005879928038127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.

Related papers

Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation [73.54930910609328]
We propose LcRL, a multilingual search-augmented reinforcement learning framework.<n>LcRL integrates a language-coupled Group Relative Policy Optimization into the policy and reward models.<n>We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict.
arXiv Detail & Related papers (2026-01-21T11:32:32Z)
TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B [4.282981703665803]
We investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL)<n>Our results show that aligning mid-level layers using TRepLiNa is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
arXiv Detail & Related papers (2025-10-03T17:36:12Z)
Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning [18.564382777331932]
We show that demonstrations in mixed HRLs consistently outperform English-only ones.<n>Surprisingly, our ablation study shows that the presence of irrelevant non-English sentences in the prompt yields measurable gains.
arXiv Detail & Related papers (2025-02-17T02:27:35Z)
Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.<n>Currently, instruction-tuned large language models (LLMs) excel at various English tasks.<n>Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities. In this work, we investigate the spontaneous multilingual alignment improvement of LLMs. We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z)
Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora. Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z)
Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages. Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z)
CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages [22.51558549091902]
We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from 'closely-related' high-resource language (HRL) Many ELRLs share lexical similarities with some HRLs, which presents a novel modeling opportunity. Existing subword-based neural MT models do not explicitly harness this lexical similarity, as they only implicitly align HRL and ELRL latent embedding space. We propose a novel, CharSpan, approach based on 'character-span noise augmentation' into the training data of HRL. This serves as a
arXiv Detail & Related papers (2023-05-09T07:23:01Z)
Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages [18.862296065737347]
We argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE, a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages.
arXiv Detail & Related papers (2022-03-03T19:35:24Z)
Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study [14.34516262614775]
We argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs. We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script) and (2) sentence structure.
arXiv Detail & Related papers (2021-06-07T20:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.