MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering
- URL: http://arxiv.org/abs/2404.13364v1
- Date: Sat, 20 Apr 2024 12:16:35 GMT
- Title: MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering
- Authors: Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Sharvi Endait, Raviraj Joshi,
- Abstract summary: This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages.
We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples.
- Score: 0.4194295877935868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. We also present a gold test set of manually verified 500 examples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP .
Related papers
- INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [26.13077589552484]
Indic-QA is the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families.
We generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance.
We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages.
arXiv Detail & Related papers (2024-07-18T13:57:16Z) - MST5 -- Multilingual Question Answering over Knowledge Graphs [1.6470999044938401]
Knowledge Graph Question Answering (KGQA) simplifies querying vast amounts of knowledge stored in a graph-based model using natural language.
Existing multilingual KGQA systems face challenges in achieving performance comparable to English systems.
We propose a simplified approach to enhance multilingual KGQA systems by incorporating linguistic context and entity information directly into the processing pipeline of a language model.
arXiv Detail & Related papers (2024-07-08T15:37:51Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Cross-Lingual Question Answering over Knowledge Base as Reading
Comprehension [61.079852289005025]
Cross-lingual question answering over knowledge base (xKBQA) aims to answer questions in languages different from that of the provided knowledge base.
One of the major challenges facing xKBQA is the high cost of data annotation.
We propose a novel approach for xKBQA in a reading comprehension paradigm.
arXiv Detail & Related papers (2023-02-26T05:52:52Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages [4.433842217026879]
Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages.
We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
arXiv Detail & Related papers (2022-04-12T13:52:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.