ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual
Open-retrieval Question Answering System
- URL: http://arxiv.org/abs/2205.14981v1
- Date: Mon, 30 May 2022 10:31:08 GMT
- Title: ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual
Open-retrieval Question Answering System
- Authors: Chia-Chien Hung, Tommaso Green, Robert Litschko, Tornike Tsereteli,
Sotaro Takeshita, Marco Bombieri, Goran Glava\v{s}, Simone Paolo Ponzetto
- Abstract summary: This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA)
In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate an answer in the language of the question.
We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation.
- Score: 16.89747171947662
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces our proposed system for the MIA Shared Task on
Cross-lingual Open-retrieval Question Answering (COQA). In this challenging
scenario, given an input question the system has to gather evidence documents
from a multilingual pool and generate from them an answer in the language of
the question. We devised several approaches combining different model variants
for three main components: Data Augmentation, Passage Retrieval, and Answer
Generation. For passage retrieval, we evaluated the monolingual BM25 ranker
against the ensemble of re-rankers based on multilingual pretrained language
models (PLMs) and also variants of the shared task baseline, re-training it
from scratch using a recently introduced contrastive loss that maintains a
strong gradient signal throughout training by means of mixed negative samples.
For answer generation, we focused on language- and domain-specialization by
means of continued language model (LM) pretraining of existing multilingual
encoders. Additionally, for both passage retrieval and answer generation, we
augmented the training data provided by the task organizers with automatically
generated question-answer pairs created from Wikipedia passages to mitigate the
issue of data scarcity, particularly for the low-resource languages for which
no training data were provided. Our results show that language- and
domain-specialization as well as data augmentation help, especially for
low-resource languages.
Related papers
- Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - One Question Answering Model for Many Languages with Cross-lingual Dense
Passage Retrieval [39.061900747689094]
CORA is a Cross-lingual Open-Retrieval Answer Generation model.
It can answer questions across many languages even when language-specific annotated data or knowledge sources are unavailable.
arXiv Detail & Related papers (2021-07-26T06:02:54Z) - GermanQuAD and GermanDPR: Improving Non-English Question Answering and
Passage Retrieval [2.5621280373733604]
We present GermanQuAD, a dataset of 13,722 extractive question/answer pairs.
An extractive QA model trained on GermanQuAD significantly outperforms multilingual models.
arXiv Detail & Related papers (2021-04-26T17:34:31Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Multilingual Transfer Learning for QA Using Translation as Data
Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space.
We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance.
Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z) - Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question
Answering [8.558954185502012]
We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data.
We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr)
arXiv Detail & Related papers (2020-10-23T20:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.