The Wikidata Query Logs Dataset
- URL: http://arxiv.org/abs/2602.14594v1
- Date: Mon, 16 Feb 2026 09:49:44 GMT
- Title: The Wikidata Query Logs Dataset
- Authors: Sebastian Walter, Hannah Bast,
- Abstract summary: We present the Wikidata Query Logs dataset, a dataset consisting of 200k question-Query pairs over the Wikidata knowledge graph.<n>It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries.<n>We present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions.
- Score: 2.9907607782169543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.
Related papers
- Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning [51.203811759364925]
mKGQAgent breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks.<n> Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants.
arXiv Detail & Related papers (2025-07-22T19:23:03Z) - Database-Augmented Query Representation for Information Retrieval [71.41745087624528]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu)<n>DAQu augments the original query with various (query-related) metadata across multiple tables.<n>We validate our DAQu in diverse retrieval scenarios, demonstrating that it significantly enhances overall retrieval performance.
arXiv Detail & Related papers (2024-06-23T05:02:21Z) - Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot
Sequence-to-Sequence Semantic Parsing over Wikidata [6.716263690738313]
This paper presents WikiWebQuestions, a high-quality question answering benchmark for Wikidata.
It consists of real-world data with SPARQL.
We modify SPARQL to use the unique domain and property names instead of their IDs.
arXiv Detail & Related papers (2023-05-23T16:20:43Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Improving Candidate Retrieval with Entity Profile Generation for
Wikidata Entity Linking [76.00737707718795]
We propose a novel candidate retrieval paradigm based on entity profiling.
We use the profile to query the indexed search engine to retrieve candidate entities.
Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary.
arXiv Detail & Related papers (2022-02-27T17:38:53Z) - A Chinese Multi-type Complex Questions Answering Dataset over Wikidata [45.31495982252219]
Complex Knowledge Base Question Answering is a popular area of research in the past decade.
Recent public datasets have led to encouraging results in this field, but are mostly limited to English.
Few state-of-the-art KBQA models are trained on Wikidata, one of the most popular real-world knowledge bases.
We propose CLC-QuAD, the first large scale complex Chinese semantic parsing dataset over Wikidata to address these challenges.
arXiv Detail & Related papers (2021-11-11T07:39:16Z) - SPARQLing Database Queries from Intermediate Question Decompositions [7.475027071883912]
To translate natural language questions into database queries, most approaches rely on a fully annotated training set.
We reduce this burden using grounded in databases intermediate question representations.
Our pipeline consists of two parts: a semantic that converts natural language questions into the intermediate representations and a non-trainable transpiler to the QLSPAR query language.
arXiv Detail & Related papers (2021-09-13T17:57:12Z) - Creating and Querying Personalized Versions of Wikidata on a Laptop [0.7449724123186383]
This paper introduces KGTK Kypher, a query language and processor that allows users to create personalized variants of Wikidata on a laptop.
We present several use cases that illustrate the types of analyses that Kypher enables users to run on the full Wikidata KG on a laptop.
arXiv Detail & Related papers (2021-08-06T00:00:33Z) - Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open
Domain Question Answering [78.9863753810787]
A large amount of world's knowledge is stored in structured databases.
query languages can answer questions that require complex reasoning, as well as offering full explainability.
arXiv Detail & Related papers (2021-08-05T22:04:13Z) - Wikidata on MARS [0.20305676256390934]
Multi-attributed relational structures (MARSs) have been proposed as a formal data model for generalized property graphs.
MARPL is a useful rule-based logic in which to write inference rules over property graphs.
Wikidata can be modelled in an extended MARS that adds the (imprecise) datatypes of Wikidata.
arXiv Detail & Related papers (2020-08-14T22:58:04Z) - RuBQ: A Russian Dataset for Question Answering over Wikidata [3.394278383312621]
RuBQ is the first Russian knowledge base question answering (KBQA) dataset.
The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, and a Wikidata sample of triples containing entities with Russian labels.
arXiv Detail & Related papers (2020-05-21T14:06:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.