MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages
- URL: http://arxiv.org/abs/2204.05814v1
- Date: Tue, 12 Apr 2022 13:52:54 GMT
- Title: MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages
- Authors: Gokul Karthik Kumar, Abhishek Singh Gehlot, Sahal Shaji Mullappilly,
Karthik Nandakumar
- Abstract summary: Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages.
We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
- Score: 4.433842217026879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accuracy of English-language Question Answering (QA) systems has improved
significantly in recent years with the advent of Transformer-based models
(e.g., BERT). These models are pre-trained in a self-supervised fashion with a
large English text corpus and further fine-tuned with a massive English QA
dataset (e.g., SQuAD). However, QA datasets on such a scale are not available
for most of the other languages. Multi-lingual BERT-based models (mBERT) are
often used to transfer knowledge from high-resource languages to low-resource
languages. Since these models are pre-trained with huge text corpora containing
multiple languages, they typically learn language-agnostic embeddings for
tokens from different languages. However, directly training an mBERT-based QA
system for low-resource languages is challenging due to the paucity of training
data. In this work, we augment the QA samples of the target language using
translation and transliteration into other languages and use the augmented data
to fine-tune an mBERT-based QA model, which is already pre-trained in English.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model
with translations from the same language family boosts the question-answering
performance, whereas the performance degrades in the case of cross-language
families. We further show that introducing a contrastive loss between the
translated question-context feature pairs during the fine-tuning process,
prevents such degradation with cross-lingual family translations and leads to
marginal improvement. The code for this work is available at
https://github.com/gokulkarthik/mucot.
Related papers
- Zero-shot Cross-lingual Transfer without Parallel Corpus [6.937772043639308]
We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model.
It consists of a Bilingual Task Fitting module that applies task-related bilingual information alignment.
A self-training module generates pseudo soft and hard labels for unlabeled data and utilizes them to conduct self-training.
arXiv Detail & Related papers (2023-10-07T07:54:22Z) - Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Multilingual Transfer Learning for QA Using Translation as Data
Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space.
We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance.
Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.