Multilingual ColBERT-X
- URL: http://arxiv.org/abs/2209.01335v1
- Date: Sat, 3 Sep 2022 06:02:52 GMT
- Title: Multilingual ColBERT-X
- Authors: Dawn Lawrie and Eugene Yang and Douglas W. Oard and James Mayfield
- Abstract summary: ColBERT-X is a dense retrieval model for Cross Language Information Retrieval ( CLIR)
In CLIR, documents are written in one natural language, while the queries are expressed in another.
A related task is multilingual IR (MLIR) where the system creates a single ranked list of documents written in many languages.
- Score: 11.768656900939048
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: ColBERT-X is a dense retrieval model for Cross Language Information Retrieval
(CLIR). In CLIR, documents are written in one natural language, while the
queries are expressed in another. A related task is multilingual IR (MLIR)
where the system creates a single ranked list of documents written in many
languages. Given that ColBERT-X relies on a pretrained multilingual neural
language model to rank documents, a multilingual training procedure can enable
a version of ColBERT-X well-suited for MLIR. This paper describes that training
procedure. An important factor for good MLIR ranking is fine-tuning XLM-R using
mixed-language batches, where the same query is matched with documents in
different languages in the same batch. Neural machine translations of MS MARCO
passages are used to fine-tune the model.
Related papers
- Distillation for Multilingual Information Retrieval [10.223578525761617]
Translate-Distill framework trains a cross-language neural dual-encoder model using translation and distillation.
This work extends Translate-Distill and propose Translate-Distill (MTD) for Multilingual information retrieval.
We show that ColBERT-X models trained with MTD outperform their counterparts trained ith Multilingual Translate-Train, by 5% to 25% in nDCG@20 and 15% to 45% in MAP.
arXiv Detail & Related papers (2024-05-02T03:30:03Z) - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Transfer Learning Approaches for Building Cross-Language Dense Retrieval
Models [25.150140840908257]
ColBERT-X is a generalization of the ColBERT multi-representation dense retrieval model to support cross-language information retrieval.
In zero-shot training, the system is trained on the English MS MARCO collection, relying on the XLM-R encoder for cross-language mappings.
In translate-train, the system is trained on the MS MARCO English queries coupled with machine translations of the associated MS MARCO passages.
arXiv Detail & Related papers (2022-01-20T22:11:38Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich
Document Understanding [34.42574051786547]
Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks.
We present a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding.
arXiv Detail & Related papers (2021-04-18T12:16:00Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.