WikiOmnia: generative QA corpus on the whole Russian Wikipedia
- URL: http://arxiv.org/abs/2204.08009v1
- Date: Sun, 17 Apr 2022 12:59:36 GMT
- Title: WikiOmnia: generative QA corpus on the whole Russian Wikipedia
- Authors: Dina Pisarevskaya, Tatiana Shavrina
- Abstract summary: We present the Wiki Omnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections.
The dataset includes every available article from Wikipedia for the Russian language.
The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification.
- Score: 0.2132096006921048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The General QA field has been developing the methodology referencing the
Stanford Question answering dataset (SQuAD) as the significant benchmark.
However, compiling factual questions is accompanied by time- and
labour-consuming annotation, limiting the training data's potential size. We
present the WikiOmnia dataset, a new publicly available set of QA-pairs and
corresponding Russian Wikipedia article summary sections, composed with a fully
automated generative pipeline. The dataset includes every available article
from Wikipedia for the Russian language. The WikiOmnia pipeline is available
open-source and is also tested for creating SQuAD-formatted QA on other
domains, like news texts, fiction, and social media. The resulting dataset
includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs
with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for
ruT5-large) and cleaned data with strict automatic verification (over 160,000
QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with
paragraphs for ruT5-large).
Related papers
- KazQAD: Kazakh Open-Domain Question Answering Dataset [2.8158674707210136]
KazQAD is a Kazakh open-domain question answering dataset.
It can be used in reading comprehension and full ODQA settings.
It contains just under 6,000 unique questions with extracted short answers.
arXiv Detail & Related papers (2024-04-06T03:40:36Z) - IfQA: A Dataset for Open-domain Question Answering under Counterfactual
Presuppositions [54.23087908182134]
We introduce the first large-scale counterfactual open-domain question-answering (QA) benchmarks, named IfQA.
The IfQA dataset contains over 3,800 questions that were annotated by crowdworkers on relevant Wikipedia passages.
The unique challenges posed by the IfQA benchmark will push open-domain QA research on both retrieval and counterfactual reasoning fronts.
arXiv Detail & Related papers (2023-05-23T12:43:19Z) - LIQUID: A Framework for List Question Answering Dataset Generation [17.86721740779611]
We propose LIQUID, an automated framework for generating list QA datasets from unlabeled corpora.
We first convert a passage from Wikipedia or PubMed into a summary and extract named entities from the summarized text as candidate answers.
We then create questions using an off-the-shelf question generator with the extracted entities and original passage.
Using our synthetic data, we significantly improve the performance of the previous best list QA models by exact-match F1 scores of 5.0 on MultiSpanQA, 1.9 on Quoref, and 2.8 averaged across three BioASQ benchmarks.
arXiv Detail & Related papers (2023-02-03T12:42:45Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource
Language [0.0]
This dataset is annotated from raw story texts of Swahili low resource language.
QA datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems.
The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project.
arXiv Detail & Related papers (2022-05-04T23:53:23Z) - Relation-Guided Pre-Training for Open-Domain Question Answering [67.86958978322188]
We propose a Relation-Guided Pre-Training (RGPT-QA) framework to solve complex open-domain questions.
We show that RGPT-QA achieves 2.2%, 2.4%, and 6.3% absolute improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions.
arXiv Detail & Related papers (2021-09-21T17:59:31Z) - QAConv: Question Answering on Informative Conversations [85.2923607672282]
We focus on informative conversations including business emails, panel discussions, and work channels.
In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions.
arXiv Detail & Related papers (2021-05-14T15:53:05Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - RuBQ: A Russian Dataset for Question Answering over Wikidata [3.394278383312621]
RuBQ is the first Russian knowledge base question answering (KBQA) dataset.
The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, and a Wikidata sample of triples containing entities with Russian labels.
arXiv Detail & Related papers (2020-05-21T14:06:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.