RuBQ: A Russian Dataset for Question Answering over Wikidata
- URL: http://arxiv.org/abs/2005.10659v1
- Date: Thu, 21 May 2020 14:06:15 GMT
- Title: RuBQ: A Russian Dataset for Question Answering over Wikidata
- Authors: Vladislav Korablinov and Pavel Braslavski
- Abstract summary: RuBQ is the first Russian knowledge base question answering (KBQA) dataset.
The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, and a Wikidata sample of triples containing entities with Russian labels.
- Score: 3.394278383312621
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The paper presents RuBQ, the first Russian knowledge base question answering
(KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of
varying complexity, their English machine translations, SPARQL queries to
Wikidata, reference answers, as well as a Wikidata sample of triples containing
entities with Russian labels. The dataset creation started with a large
collection of question-answer pairs from online quizzes. The data underwent
automatic filtering, crowd-assisted entity linking, automatic generation of
SPARQL queries, and their subsequent in-house verification.
Related papers
- Integrating SPARQL and LLMs for Question Answering over Scholarly Data Sources [0.0]
This paper describes a methodology that combines SPARQL queries, divide and conquer algorithms, and BERT-based-case-SQuad2 predictions.
The approach, evaluated with Exact Match and F-score metrics, shows promise for improving QA accuracy and efficiency in scholarly contexts.
arXiv Detail & Related papers (2024-09-11T14:50:28Z) - SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions [6.933892616704001]
We introduce the SPINACH dataset, an expert-annotated KBQA dataset collected from discussions on Wikidata's "Request a Query" forum.
The complexity of these in-the-wild queries calls for a KBQA system that can dynamically explore large and often incomplete schemas and reason about them.
We also introduce an in-context learning KBQA agent, also called SPINACH, that mimics how a human expert would write SPARQLs to handle challenging questions.
arXiv Detail & Related papers (2024-07-16T06:18:21Z) - NewsQs: Multi-Source Question Generation for the Inquiring Mind [59.79288644158271]
We present NewsQs, a dataset that provides question-answer pairs for multiple news documents.
To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles.
arXiv Detail & Related papers (2024-02-28T16:59:35Z) - KGConv, a Conversational Corpus grounded in Wikidata [6.451914896767135]
KGConv is a large, conversational corpus of 71k conversations grounded in a Wikidata fact.
We provide multiple variants (12 on average) of the corresponding question using templates, human annotations, hand-crafted rules and a question rewriting neural model.
KGConv can further be used for other generation and analysis tasks such as single-turn question generation from Wikidata triples, question rewriting, question answering from conversation or from knowledge graphs and quiz generation.
arXiv Detail & Related papers (2023-08-29T13:35:51Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - WikiOmnia: generative QA corpus on the whole Russian Wikipedia [0.2132096006921048]
We present the Wiki Omnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections.
The dataset includes every available article from Wikipedia for the Russian language.
The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification.
arXiv Detail & Related papers (2022-04-17T12:59:36Z) - A Chinese Multi-type Complex Questions Answering Dataset over Wikidata [45.31495982252219]
Complex Knowledge Base Question Answering is a popular area of research in the past decade.
Recent public datasets have led to encouraging results in this field, but are mostly limited to English.
Few state-of-the-art KBQA models are trained on Wikidata, one of the most popular real-world knowledge bases.
We propose CLC-QuAD, the first large scale complex Chinese semantic parsing dataset over Wikidata to address these challenges.
arXiv Detail & Related papers (2021-11-11T07:39:16Z) - ConditionalQA: A Complex Reading Comprehension Dataset with Conditional
Answers [93.55268936974971]
We describe a Question Answering dataset that contains complex questions with conditional answers.
We call this dataset ConditionalQA.
We show that ConditionalQA is challenging for many of the existing QA models, especially in selecting answer conditions.
arXiv Detail & Related papers (2021-10-13T17:16:46Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.