Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially
Code-Switched Data
- URL: http://arxiv.org/abs/2305.05295v2
- Date: Fri, 26 May 2023 13:16:42 GMT
- Title: Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially
Code-Switched Data
- Authors: Robert Litschko, Ekaterina Artemova, Barbara Plank
- Abstract summary: We show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages.
Motivated by this, we propose to train ranking models on artificially code-switched data instead.
- Score: 26.38449396649045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transferring information retrieval (IR) models from a high-resource language
(typically English) to other languages in a zero-shot fashion has become a
widely adopted approach. In this work, we show that the effectiveness of
zero-shot rankers diminishes when queries and documents are present in
different languages. Motivated by this, we propose to train ranking models on
artificially code-switched data instead, which we generate by utilizing
bilingual lexicons. To this end, we experiment with lexicons induced from (1)
cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use
the mMARCO dataset to extensively evaluate reranking models on 36 language
pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual
IR (MLIR). Our results show that code-switching can yield consistent and
substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while
maintaining stable performance in MoIR. Encouragingly, the gains are especially
pronounced for distant languages (up to 2x absolute gain). We further show that
our approach is robust towards the ratio of code-switched tokens and also
extends to unseen languages. Our results demonstrate that training on
code-switched data is a cheap and effective way of generalizing zero-shot
rankers for cross-lingual and multilingual retrieval.
Related papers
- Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-09-19T19:30:56Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Learning Disentangled Semantic Representations for Zero-Shot
Cross-Lingual Transfer in Multilingual Machine Reading Comprehension [40.38719019711233]
Multilingual pre-trained models are able to zero-shot transfer knowledge from rich-resource languages to low-resource languages in machine reading comprehension (MRC)
In this paper, we propose a novel multilingual MRC framework equipped with a Siamese Semantic Disentanglement Model (SSDM) to disassociate semantics from syntax in representations learned by multilingual pre-trained models.
arXiv Detail & Related papers (2022-04-03T05:26:42Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Learning Cross-Lingual IR from an English Retriever [10.27108918912692]
The proposed model is far more effective than the existing approach of fine-tuning with cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.
arXiv Detail & Related papers (2021-12-15T15:07:54Z) - Multilingual Transfer Learning for QA Using Translation as Data
Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space.
We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance.
Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.