Soft Prompt Decoding for Multilingual Dense Retrieval
- URL: http://arxiv.org/abs/2305.09025v1
- Date: Mon, 15 May 2023 21:17:17 GMT
- Title: Soft Prompt Decoding for Multilingual Dense Retrieval
- Authors: Zhiqi Huang, Hansi Zeng, Hamed Zamani and James Allan
- Abstract summary: We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
- Score: 30.766917713997355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we explore a Multilingual Information Retrieval (MLIR) task,
where the collection includes documents in multiple languages. We demonstrate
that applying state-of-the-art approaches developed for cross-lingual
information retrieval to MLIR tasks leads to sub-optimal performance. This is
due to the heterogeneous and imbalanced nature of multilingual collections --
some languages are better represented in the collection and some benefit from
large-scale training data. To address this issue, we present KD-SPD, a novel
soft prompt decoding approach for MLIR that implicitly "translates" the
representation of documents in different languages into the same embedding
space. To address the challenges of data scarcity and imbalance, we introduce a
knowledge distillation strategy. The teacher model is trained on rich English
retrieval data, and by leveraging bi-text data, our distillation framework
transfers its retrieval knowledge to the multilingual document encoder.
Therefore, our approach does not require any multilingual retrieval training
data. Extensive experiments on three MLIR datasets with a total of 15 languages
demonstrate that KD-SPD significantly outperforms competitive baselines in all
cases. We conduct extensive analyses to show that our method has less language
bias and better zero-shot transfer ability towards new languages.
Related papers
- mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.
We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.
We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages.
Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents.
Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Multilingual Transfer Learning for QA Using Translation as Data
Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space.
We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance.
Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.