Wikidata as a seed for Web Extraction
- URL: http://arxiv.org/abs/2401.07812v1
- Date: Mon, 15 Jan 2024 16:35:52 GMT
- Title: Wikidata as a seed for Web Extraction
- Authors: Kunpeng Guo, Dennis Diefenbach, Antoine Gourru, Christophe Gravier
- Abstract summary: We present a framework that is able to identify and extract new facts that are published under multiple Web domains.
We take inspiration from ideas that are used to extract facts from textual collections and adapt them to extract facts from Web pages.
Our experiments show that we can achieve a mean performance of 84.07 at F1-score.
- Score: 4.273966905160028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wikidata has grown to a knowledge graph with an impressive size. To date, it
contains more than 17 billion triples collecting information about people,
places, films, stars, publications, proteins, and many more. On the other side,
most of the information on the Web is not published in highly structured data
repositories like Wikidata, but rather as unstructured and semi-structured
content, more concretely in HTML pages containing text and tables. Finding,
monitoring, and organizing this data in a knowledge graph is requiring
considerable work from human editors. The volume and complexity of the data
make this task difficult and time-consuming. In this work, we present a
framework that is able to identify and extract new facts that are published
under multiple Web domains so that they can be proposed for validation by
Wikidata editors. The framework is relying on question-answering technologies.
We take inspiration from ideas that are used to extract facts from textual
collections and adapt them to extract facts from Web pages. For achieving this,
we demonstrate that language models can be adapted to extract facts not only
from textual collections but also from Web pages. By exploiting the information
already contained in Wikidata the proposed framework can be trained without the
need for any additional learning signals and can extract new facts for a wide
range of properties and domains. Following this path, Wikidata can be used as a
seed to extract facts on the Web. Our experiments show that we can achieve a
mean performance of 84.07 at F1-score. Moreover, our estimations show that we
can potentially extract millions of facts that can be proposed for human
validation. The goal is to help editors in their daily tasks and contribute to
the completion of the Wikidata knowledge graph.
Related papers
- Leveraging Wikidata's edit history in knowledge graph refinement tasks [77.34726150561087]
edit history represents the process in which the community reaches some kind of fuzzy and distributed consensus.
We build a dataset containing the edit history of every instance from the 100 most important classes in Wikidata.
We propose and evaluate two new methods to leverage this edit history information in knowledge graph embedding models for type prediction tasks.
arXiv Detail & Related papers (2022-10-27T14:32:45Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Enriching Wikidata with Linked Open Data [4.311189028205597]
Current linked open data (LOD) tools are not suitable to enrich large graphs like Wikidata.
We present a novel workflow that includes gap detection, source selection, schema alignment, and semantic validation.
Our experiments show that our workflow can enrich Wikidata with millions of novel statements from external LOD sources with a high quality.
arXiv Detail & Related papers (2022-07-01T01:50:24Z) - Improving Candidate Retrieval with Entity Profile Generation for
Wikidata Entity Linking [76.00737707718795]
We propose a novel candidate retrieval paradigm based on entity profiling.
We use the profile to query the indexed search engine to retrieve candidate entities.
Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary.
arXiv Detail & Related papers (2022-02-27T17:38:53Z) - Wikidated 1.0: An Evolving Knowledge Graph Dataset of Wikidata's
Revision History [5.727994421498849]
We present Wikidated 1.0, a dataset of Wikidata's full revision history.
To the best of our knowledge, it constitutes the first large dataset of an evolving knowledge graph.
arXiv Detail & Related papers (2021-12-09T15:54:03Z) - Survey on English Entity Linking on Wikidata [3.8289963781051415]
Wikidata is a frequently updated, community-driven, and multilingual knowledge graph.
Current Wikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia.
Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics such as the hyper-relational structure.
arXiv Detail & Related papers (2021-12-03T16:02:42Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.