Graph integration of structured, semistructured and unstructured data
for data journalism
- URL: http://arxiv.org/abs/2007.12488v2
- Date: Fri, 30 Oct 2020 08:07:09 GMT
- Title: Graph integration of structured, semistructured and unstructured data
for data journalism
- Authors: Oana Balalau (CEDAR), Catarina Concei\c{c}{\~a}o (INESC-ID, IST),
Helena Galhardas (INESC-ID, IST), Ioana Manolescu (CEDAR), Tayeb Merabti
(CEDAR), Jingmao You (CEDAR, IP Paris), Youssr Youssef (CEDAR, ENSAE, IP
Paris)
- Abstract summary: We describe a complete approach for integrating dynamic sets of heterogeneous data sources.
Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, journalism is facilitated by the existence of large amounts of
digital data sources, including many Open Data ones. Such data sources are
extremely heterogeneous, ranging from highly struc-tured (relational
databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text.
Journalists (and other classes of users lacking advanced IT expertise, such as
most non-governmental-organizations, or small public administrations) need to
be able to make sense of such heterogeneous corpora, even if they lack the
ability to de ne and deploy custom extract-transform-load work ows. These are
di cult to set up not only for arbitrary heterogeneous inputs , but also given
that users may want to add (or remove) datasets to (from) the corpus. We
describe a complete approach for integrating dynamic sets of heterogeneous data
sources along the lines described above: the challenges we faced to make such
graphs useful, allow their integration to scale, and the solutions we proposed
for these problems. Our approach is implemented within the ConnectionLens
system; we validate it through a set of experiments.
Related papers
- Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs [0.061446808540639365]
This work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing a human-centered manner complex analytics.
The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems.
arXiv Detail & Related papers (2024-11-01T20:45:23Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Federated Neural Graph Databases [53.03085605769093]
We propose Federated Neural Graph Database (FedNGDB), a novel framework that enables reasoning over multi-source graph-based data while preserving privacy.
Unlike existing methods, FedNGDB can handle complex graph structures and relationships, making it suitable for various downstream tasks.
arXiv Detail & Related papers (2024-02-22T14:57:44Z) - Cross Modal Data Discovery over Structured and Unstructured Data Lakes [5.270224494298927]
Organizations are collecting increasingly large amounts of data for data driven decision making.
These data are often dumped into a centralized repository, consisting of thousands of structured and unstructured datasets.
Perversely, such mixture of datasets makes the problem of discovering elements relevant to a user's query or an analytical task very challenging.
arXiv Detail & Related papers (2023-06-01T17:34:42Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Graph integration of structured, semistructured and unstructured data
for data journalism [4.508924138721326]
We describe a complete approach for integrating dynamic sets of heterogeneous datasets.
Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.
arXiv Detail & Related papers (2020-12-16T09:59:27Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z) - Siamese Graph Neural Networks for Data Integration [11.41207739004894]
We propose a general approach to modeling and integrating entities from structured data, such as relational databases, as well as unstructured sources, such as free text from news articles.
Our approach is designed to explicitly model and leverage relations between entities, thereby using all available information and preserving as much context as possible.
We evaluate our method on the task of integrating data about business entities, and we demonstrate that it outperforms standard rule-based systems, as well as other deep learning approaches that do not use graph-based representations.
arXiv Detail & Related papers (2020-01-17T21:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.