Translate-Distill: Learning Cross-Language Dense Retrieval by
Translation and Distillation
- URL: http://arxiv.org/abs/2401.04810v1
- Date: Tue, 9 Jan 2024 20:40:49 GMT
- Title: Translate-Distill: Learning Cross-Language Dense Retrieval by
Translation and Distillation
- Authors: Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and
Scott Miller
- Abstract summary: This paper proposes Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model.
This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR.
- Score: 17.211592060717713
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Prior work on English monolingual retrieval has shown that a cross-encoder
trained using a large number of relevance judgments for query-document pairs
can be used as a teacher to train more efficient, but similarly effective,
dual-encoder student models. Applying a similar knowledge distillation approach
to training an efficient dual-encoder model for Cross-Language Information
Retrieval (CLIR), where queries and documents are in different languages, is
challenging due to the lack of a sufficiently large training collection when
the query and document languages differ. The state of the art for CLIR thus
relies on translating queries, documents, or both from the large English MS
MARCO training set, an approach called Translate-Train. This paper proposes an
alternative, Translate-Distill, in which knowledge distillation from either a
monolingual cross-encoder or a CLIR cross-encoder is used to train a
dual-encoder CLIR student model. This richer design space enables the teacher
model to perform inference in an optimized setting, while training the student
model directly for CLIR. Trained models and artifacts are publicly available on
Huggingface.
Related papers
- Distillation for Multilingual Information Retrieval [10.223578525761617]
Translate-Distill framework trains a cross-language neural dual-encoder model using translation and distillation.
This work extends Translate-Distill and propose Translate-Distill (MTD) for Multilingual information retrieval.
We show that ColBERT-X models trained with MTD outperform their counterparts trained ith Multilingual Translate-Train, by 5% to 25% in nDCG@20 and 15% to 45% in MAP.
arXiv Detail & Related papers (2024-05-02T03:30:03Z) - Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
Retrieval [57.98555925471121]
Cross-lingual cross-modal retrieval has attracted increasing attention.
Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation.
We propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR.
arXiv Detail & Related papers (2023-09-11T13:44:46Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - Transfer Learning Approaches for Building Cross-Language Dense Retrieval
Models [25.150140840908257]
ColBERT-X is a generalization of the ColBERT multi-representation dense retrieval model to support cross-language information retrieval.
In zero-shot training, the system is trained on the English MS MARCO collection, relying on the XLM-R encoder for cross-language mappings.
In translate-train, the system is trained on the MS MARCO English queries coupled with machine translations of the associated MS MARCO passages.
arXiv Detail & Related papers (2022-01-20T22:11:38Z) - Learning Cross-Lingual IR from an English Retriever [10.27108918912692]
The proposed model is far more effective than the existing approach of fine-tuning with cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.
arXiv Detail & Related papers (2021-12-15T15:07:54Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.