Learning Cross-Lingual IR from an English Retriever
- URL: http://arxiv.org/abs/2112.08185v1
- Date: Wed, 15 Dec 2021 15:07:54 GMT
- Title: Learning Cross-Lingual IR from an English Retriever
- Authors: Yulong Li, Martin Franz, Md Arafat Sultan, Bhavani Iyer, Young-Suk
Lee, Avirup Sil
- Abstract summary: The proposed model is far more effective than the existing approach of fine-tuning with cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.
- Score: 10.27108918912692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new cross-lingual information retrieval (CLIR) model trained
using multi-stage knowledge distillation (KD). The teacher and the student are
heterogeneous systems-the former is a pipeline that relies on machine
translation and monolingual IR, while the latter executes a single CLIR
operation. We show that the student can learn both multilingual representations
and CLIR by optimizing two corresponding KD objectives. Learning multilingual
representations from an English-only retriever is accomplished using a novel
cross-lingual alignment algorithm that greedily re-positions the teacher tokens
for alignment. Evaluation on the XOR-TyDi benchmark shows that the proposed
model is far more effective than the existing approach of fine-tuning with
cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.
Related papers
- Translate-Distill: Learning Cross-Language Dense Retrieval by
Translation and Distillation [17.211592060717713]
This paper proposes Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model.
This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR.
arXiv Detail & Related papers (2024-01-09T20:40:49Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially
Code-Switched Data [26.38449396649045]
We show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages.
Motivated by this, we propose to train ranking models on artificially code-switched data instead.
arXiv Detail & Related papers (2023-05-09T09:32:19Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.