Building a Rich Dataset to Empower the Persian Question Answering Systems
- URL: http://arxiv.org/abs/2412.20212v1
- Date: Sat, 28 Dec 2024 16:53:25 GMT
- Title: Building a Rich Dataset to Empower the Persian Question Answering Systems
- Authors: Mohsen Yazdinejad, Marjan Kaedi,
- Abstract summary: This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers.
BERT-based question answering model has been applied to this dataset using two pre-trained language models.
Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score.
- Score: 0.6138671548064356
- License:
- Abstract: Question answering systems provide short, precise, and specific answers to questions. So far, many robust question answering systems have been developed for English, while some languages with fewer resources, like Persian, have few numbers of standard dataset. In this study, a comprehensive open-domain dataset is presented for Persian. This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers. Then, a BERT-based question answering model has been applied to this dataset using two pre-trained language models, including ParsBERT and XLM-RoBERTa. The results of these two models have been ensembled using mean logits. Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score. Also, to compare the NextQuAD with other Persian datasets, our trained model on the NextQuAD, is evaluated on two other datasets named PersianQA and ParSQuAD. Comparisons show that the proposed model increased EM by 0.39 and 0.14 respectively in PersianQA and ParSQuAD-manual, while a slight EM decline of 0.007 happened in ParSQuAD-automatic.
Related papers
- AmaSQuAD: A Benchmark for Amharic Extractive Question Answering [0.0]
This research presents a novel framework for translating extractive question-answering datasets into low-resource languages.
The methodology addresses challenges related to misalignment between translated questions and answers.
We fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering.
arXiv Detail & Related papers (2025-02-04T06:27:39Z) - Datasets for Multilingual Answer Sentence Selection [59.28492975191415]
We introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish)
Results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models.
arXiv Detail & Related papers (2024-06-14T16:50:29Z) - UQA: Corpus for Urdu Question Answering [3.979019316355144]
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu.
UQA is generated by translating the Stanford Question Answering dataset (SQuAD2.0), a large-scale English QA dataset.
The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T.
arXiv Detail & Related papers (2024-05-02T16:44:31Z) - Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian [0.0]
We create the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr.
To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset.
Best results were obtained by fine-tuning the BERTi'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score.
arXiv Detail & Related papers (2024-04-12T17:27:54Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Generative Language Models for Paragraph-Level Question Generation [79.31199020420827]
Powerful generative models have led to recent progress in question generation (QG)
It is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches.
We introduce QG-Bench, a benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting.
arXiv Detail & Related papers (2022-10-08T10:24:39Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - PQuAD: A Persian Question Answering Dataset [0.0]
crowdsourced reading comprehension dataset on Persian Wikipedia articles.
Includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable.
Our experiments on different state-of-the-art pre-trained contextualized language models show 74.8% Exact Match (EM) and 87.6% F1-score.
arXiv Detail & Related papers (2022-02-13T05:42:55Z) - QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia
and Wikidata Translated by Native Speakers [68.9964449363406]
We extend one of the most popular KGQA benchmarks - QALD-9 by introducing high-quality questions' translations to 8 languages.
Five of the languages - Armenian, Ukrainian, Lithuanian, Bashkir and Belarusian - to our best knowledge were never considered in KGQA research community before.
arXiv Detail & Related papers (2022-01-31T22:19:55Z) - PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge
Graph [0.0]
This paper introduces textitPeCoQ, a dataset for Persian question answering.
This dataset contains 10,000 complex questions and answers extracted from the Persian knowledge graph, FarsBase.
There are different types of complexities in the dataset, such as multi-relation, multi-entity, ordinal, and temporal constraints.
arXiv Detail & Related papers (2021-06-27T08:21:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.