Transfer Learning Approaches for Building Cross-Language Dense Retrieval
Models
- URL: http://arxiv.org/abs/2201.08471v1
- Date: Thu, 20 Jan 2022 22:11:38 GMT
- Title: Transfer Learning Approaches for Building Cross-Language Dense Retrieval
Models
- Authors: Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton
Murray, James Mayfield, Douglas W. Oard
- Abstract summary: ColBERT-X is a generalization of the ColBERT multi-representation dense retrieval model to support cross-language information retrieval.
In zero-shot training, the system is trained on the English MS MARCO collection, relying on the XLM-R encoder for cross-language mappings.
In translate-train, the system is trained on the MS MARCO English queries coupled with machine translations of the associated MS MARCO passages.
- Score: 25.150140840908257
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The advent of transformer-based models such as BERT has led to the rise of
neural ranking models. These models have improved the effectiveness of
retrieval systems well beyond that of lexical term matching models such as
BM25. While monolingual retrieval tasks have benefited from large-scale
training collections such as MS MARCO and advances in neural architectures,
cross-language retrieval tasks have fallen behind these advancements. This
paper introduces ColBERT-X, a generalization of the ColBERT
multi-representation dense retrieval model that uses the XLM-RoBERTa (XLM-R)
encoder to support cross-language information retrieval (CLIR). ColBERT-X can
be trained in two ways. In zero-shot training, the system is trained on the
English MS MARCO collection, relying on the XLM-R encoder for cross-language
mappings. In translate-train, the system is trained on the MS MARCO English
queries coupled with machine translations of the associated MS MARCO passages.
Results on ad hoc document ranking tasks in several languages demonstrate
substantial and statistically significant improvements of these trained dense
retrieval models over traditional lexical CLIR baselines.
Related papers
- ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - Translate-Distill: Learning Cross-Language Dense Retrieval by
Translation and Distillation [17.211592060717713]
This paper proposes Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model.
This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR.
arXiv Detail & Related papers (2024-01-09T20:40:49Z) - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - Lost in Translation, Found in Spans: Identifying Claims in Multilingual
Social Media [40.26888469822391]
Claim span identification (CSI) is an important step in fact-checking pipelines.
Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem.
We create a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English.
arXiv Detail & Related papers (2023-10-27T15:28:12Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Multilingual ColBERT-X [11.768656900939048]
ColBERT-X is a dense retrieval model for Cross Language Information Retrieval ( CLIR)
In CLIR, documents are written in one natural language, while the queries are expressed in another.
A related task is multilingual IR (MLIR) where the system creates a single ranked list of documents written in many languages.
arXiv Detail & Related papers (2022-09-03T06:02:52Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.