Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages
- URL: http://arxiv.org/abs/2507.21568v2
- Date: Thu, 31 Jul 2025 08:13:22 GMT
- Title: Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages
- Authors: Aarón Galiano-Jiménez, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Víctor M. Sánchez-Cartagena,
- Abstract summary: We argue that the teacher model's output distribution holds valuable insights for the student.<n>We present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence.
- Score: 2.2061683015812026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model's output distribution holds valuable insights for the student, beyond the approximated mode obtained through beam search (the standard decoding method), and present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence. This provides a larger representation of the teacher model distribution and exposes the student model to a wider range of target-side prefixes. We leverage $n$-best lists from beam search to guide the student's learning and examine alternative decoding methods to address issues like low variability and the under-representation of infrequent tokens. For low-resource languages, our research shows that while sampling methods may slightly compromise translation quality compared to beam search based approaches, they enhance the generated corpora with greater variability and lexical richness. This ultimately improves student model performance and mitigates the gender bias amplification often associated with KD.
Related papers
- Multiple Choice Learning of Low Rank Adapters for Language Modeling [40.380297530862656]
We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time.<n>We demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.
arXiv Detail & Related papers (2025-07-14T16:00:51Z) - Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment [10.104085497265004]
We propose Ranking Loss based Knowledge Distillation (RLKD), which encourages consistency of peak predictions between the teacher and student models.<n>Our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
arXiv Detail & Related papers (2024-09-19T08:06:42Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - A Cohesive Distillation Architecture for Neural Language Models [0.0]
A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size.
This study investigates methods for Knowledge Distillation (KD) to provide efficient alternatives to large-scale models.
arXiv Detail & Related papers (2023-01-12T08:01:53Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Multi-Level Knowledge Distillation for Out-of-Distribution Detection in
Text [12.428289757859433]
Self-supervised representation learning has proved to be a valuable component for out-of-distribution (OoD) detection.
In this paper, we analyze the complementary characteristics of both OoD detection methods.
We propose a multi-level knowledge distillation approach that integrates their strengths while mitigating their limitations.
arXiv Detail & Related papers (2022-11-21T09:41:25Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.