Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web
- URL: http://arxiv.org/abs/2408.14636v1
- Date: Mon, 26 Aug 2024 21:00:25 GMT
- Title: Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web
- Authors: Kate Lin, Tarfah Alrashed, Natasha Noy,
- Abstract summary: We study dataset relationships from the perspective of users who discover, use, and share datasets on the Web.
We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery.
We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%.
- Score: 1.02801486034657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with schema.org markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Learning Representations without Compositional Assumptions [79.12273403390311]
We propose a data-driven approach that learns feature set dependencies by representing feature sets as graph nodes and their relationships as learnable edges.
We also introduce LEGATO, a novel hierarchical graph autoencoder that learns a smaller, latent graph to aggregate information from multiple views dynamically.
arXiv Detail & Related papers (2023-05-31T10:36:10Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - RTE: A Tool for Annotating Relation Triplets from Text [3.2958527541557525]
In relation extraction, we focus on binary relation that refers to relations between two entities.
The lack of annotated clean dataset is a key challenge in this area of research.
In this work, we built a web-based tool where researchers can annotate for relation extraction on their own datasets.
arXiv Detail & Related papers (2021-08-18T14:54:22Z) - WebRED: Effective Pretraining And Finetuning For Relation Extraction On
The Web [4.702325864333419]
WebRED is a strongly-supervised human annotated dataset for extracting relationships from text found on the World Wide Web.
We show that combining pre-training on a large weakly supervised dataset with fine-tuning on a small strongly-supervised dataset leads to better relation extraction performance.
arXiv Detail & Related papers (2021-02-18T23:56:12Z) - Mining Feature Relationships in Data [0.0]
Feature relationship mining (FRM) uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data.
Our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features.
Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships.
arXiv Detail & Related papers (2021-02-02T07:06:16Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z) - Exploration and Discovery of the COVID-19 Literature through Semantic
Visualization [9.687961759392559]
We are developing semantic visualization techniques to enhance exploration and enable discovery over large datasets of relations.
Our hope is that this will enable the discovery of novel inferences over relations in complex data that otherwise would go unnoticed.
arXiv Detail & Related papers (2020-07-03T16:40:37Z) - On Embeddings in Relational Databases [11.52782249184251]
We address the problem of learning a distributed representation of entities in a relational database using a low-dimensional embedding.
Recent methods for learning embedding constitute of a naive approach to consider complete denormalization of the database by relationalizing the full join of all tables and representing as a knowledge graph.
In this paper we demonstrate; a better methodology for learning representations by exploiting the underlying semantics of columns in a table while using the relation joins and the latent inter-row relationships.
arXiv Detail & Related papers (2020-05-13T17:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.