Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
- URL: http://arxiv.org/abs/2510.14274v1
- Date: Thu, 16 Oct 2025 03:48:59 GMT
- Title: Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
- Authors: Lifu Tu, Yingbo Zhou, Semih Yavuz,
- Abstract summary: Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives.<n>We investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity.<n>We develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.
- Score: 30.737678658069097
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.
Related papers
- Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation [4.2178072320683375]
We investigate scaling as an efficient strategy for adapting pretrained models to new target languages.<n>We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained.<n>Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems.
arXiv Detail & Related papers (2025-12-11T16:09:54Z) - mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [16.953894896444403]
We introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages.<n>We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models.<n>Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models.
arXiv Detail & Related papers (2025-10-01T17:36:59Z) - Boosting Data Utilization for Multilingual Dense Retrieval [47.16651389111977]
We propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data.<n>The experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-09-11T13:42:50Z) - Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [86.98098988779809]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.<n>We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.<n>Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z) - The Impact of Model Scaling on Seen and Unseen Language Performance [2.012425476229879]
We study the performance and scaling behavior of multilingual Large Language Models across 204 languages.<n>Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios.<n>In two-shot settings, larger models show clear linear improvements in multilingual text classification.
arXiv Detail & Related papers (2025-01-10T00:10:21Z) - A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models [64.79218405438871]
Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models.<n>We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora.
arXiv Detail & Related papers (2024-06-29T13:12:39Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Improving the Cross-Lingual Generalisation in Visual Question Answering [40.86774711775718]
multilingual vision-language pretrained models show poor cross-lingual generalisation when applied to non-English data.
In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task.
We improve cross-lingual transfer with three strategies: (1) we introduce a linguistic prior objective to augment the cross-entropy loss with a similarity-based loss to guide the model during training, (2) we learn a task-specific subnetwork that improves cross-lingual generalisation and reduces variance without model modification, and (3) we augment training examples using synthetic code
arXiv Detail & Related papers (2022-09-07T08:07:43Z) - EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning [38.928786416891924]
We introduce efficient and effective massively multilingual sentence embedding (EMS) using cross-lingual token-level reconstruction (XTR) and sentence-level contrastive learning as training objectives.
Compared with related studies, the proposed model can be efficiently trained using significantly fewer parallel sentences and GPU computation resources.
We release the codes for model training and the EMS pre-trained sentence embedding model, which supports 62 languages.
arXiv Detail & Related papers (2022-05-31T12:29:25Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.