UQuAD1.0: Development of an Urdu Question Answering Training Data for
Machine Reading Comprehension
- URL: http://arxiv.org/abs/2111.01543v1
- Date: Tue, 2 Nov 2021 12:25:04 GMT
- Title: UQuAD1.0: Development of an Urdu Question Answering Training Data for
Machine Reading Comprehension
- Authors: Samreen Kazi (1), Shakeel Khoja (1) ((1) School of Mathematics &
Computer Science, Institute of Business Administration, Karachi Pakistan)
- Abstract summary: This work explores the semi-automated creation of the Urdu Question Answering dataset (UQuAD1.0)
In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing.
Using XLMRoBERTa and multi-lingual BERT, we acquire an F1 score of 0.66 and 0.63, respectively.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, low-resource Machine Reading Comprehension (MRC) has made
significant progress, with models getting remarkable performance on various
language datasets. However, none of these models have been customized for the
Urdu language. This work explores the semi-automated creation of the Urdu
Question Answering Dataset (UQuAD1.0) by combining machine-translated SQuAD
with human-generated samples derived from Wikipedia articles and Urdu RC
worksheets from Cambridge O-level books. UQuAD1.0 is a large-scale Urdu dataset
intended for extractive machine reading comprehension tasks consisting of 49k
question Answers pairs in question, passage, and answer format. In UQuAD1.0,
45000 pairs of QA were generated by machine translation of the original
SQuAD1.0 and approximately 4000 pairs via crowdsourcing. In this study, we used
two types of MRC models: rule-based baseline and advanced Transformer-based
models. However, we have discovered that the latter outperforms the others;
thus, we have decided to concentrate solely on Transformer-based architectures.
Using XLMRoBERTa and multi-lingual BERT, we acquire an F1 score of 0.66 and
0.63, respectively.
Related papers
- UQA: Corpus for Urdu Question Answering [3.979019316355144]
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu.
UQA is generated by translating the Stanford Question Answering dataset (SQuAD2.0), a large-scale English QA dataset.
The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T.
arXiv Detail & Related papers (2024-05-02T16:44:31Z) - MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering [0.4194295877935868]
This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages.
We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples.
arXiv Detail & Related papers (2024-04-20T12:16:35Z) - Question answering using deep learning in low resource Indian language
Marathi [0.0]
We investigate different transformer models for creating a reading comprehension-based question answering system.
We got the best accuracy in a MuRIL multilingual model with an EM score of 0.64 and F1 score of 0.74 by fine tuning the model on the Marathi dataset.
arXiv Detail & Related papers (2023-09-27T16:53:11Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - Generative Language Models for Paragraph-Level Question Generation [79.31199020420827]
Powerful generative models have led to recent progress in question generation (QG)
It is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches.
We introduce QG-Bench, a benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting.
arXiv Detail & Related papers (2022-10-08T10:24:39Z) - MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages [4.433842217026879]
Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages.
We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
arXiv Detail & Related papers (2022-04-12T13:52:54Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - XLM-T: Scaling up Multilingual Machine Translation with Pretrained
Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data.
This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - When in Doubt, Ask: Generating Answerable and Unanswerable Questions,
Unsupervised [0.0]
Question Answering (QA) is key for making possible a robust communication between human and machine.
Modern language models used for QA have surpassed the human-performance in several essential tasks.
This paper studies augmenting human-made datasets with synthetic data as a way of surmounting this problem.
arXiv Detail & Related papers (2020-10-04T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.