Leveraging Schema Labels to Enhance Dataset Search
- URL: http://arxiv.org/abs/2001.10112v1
- Date: Mon, 27 Jan 2020 22:41:02 GMT
- Title: Leveraging Schema Labels to Enhance Dataset Search
- Authors: Zhiyu Chen, Haiyan Jia, Jeff Heflin, Brian D. Davison
- Abstract summary: We propose a novel schema label generation model which generates possible schema labels based on dataset table content.
We incorporate the generated schema labels into a mixed ranking model which considers the relevance between the query and dataset metadata.
Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task.
- Score: 20.63182827636973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A search engine's ability to retrieve desirable datasets is important for
data sharing and reuse. Existing dataset search engines typically rely on
matching queries to dataset descriptions. However, a user may not have enough
prior knowledge to write a query using terms that match with description
text.We propose a novel schema label generation model which generates possible
schema labels based on dataset table content. We incorporate the generated
schema labels into a mixed ranking model which not only considers the relevance
between the query and dataset metadata but also the similarity between the
query and generated schema labels. To evaluate our method on real-world
datasets, we create a new benchmark specifically for the dataset retrieval
task. Experiments show that our approach can effectively improve the precision
and NDCG scores of the dataset retrieval task compared with baseline methods.
We also test on a collection of Wikipedia tables to show that the features
generated from schema labels can improve the unsupervised and supervised web
table retrieval task as well.
Related papers
- Schema Matching with Large Language Models: an Experimental Study [0.580553237364985]
We investigate the use of an off-the-shelf Large Language Models (LLMs) for schema matching.
Our objective is to identify semantic correspondences between elements of two relational schemas using only names and descriptions.
arXiv Detail & Related papers (2024-07-16T15:33:00Z) - Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu)
DAQu augments the original query with various (query-related) metadata across multiple tables.
We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z) - Standardness Fogs Meaning: A Position Regarding the Informed Usage of Standard Datasets [0.5497663232622965]
We evaluate the match between use case, derived categories, and labels of standard datasets.
For the 20 Newsgroups dataset, we demonstrate that the labels are imprecise.
We conclude that a concept of standardness of a dataset implies that there is a match between use case, derived categories, and class labels.
arXiv Detail & Related papers (2024-06-19T13:39:05Z) - QueryNER: Segmentation of E-commerce Queries [12.563241705572409]
We present a manually-annotated dataset and accompanying model for e-commerce query segmentation.
Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types.
arXiv Detail & Related papers (2024-05-15T16:58:35Z) - ReMatch: Retrieval Enhanced Schema Matching with LLMs [0.874967598360817]
We present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs)
Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher.
arXiv Detail & Related papers (2024-03-03T17:14:40Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks.
We then tailor the labeling model specifically to the task of entity matching.
We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Semantic Labeling Using a Deep Contextualized Language Model [9.719972529205101]
We propose a context-aware semantic labeling method using both the column values and context.
Our new method is based on a new setting for semantic labeling, where we sequentially predict labels for an input table with missing headers.
To our knowledge, we are the first to successfully apply BERT to solve the semantic labeling task.
arXiv Detail & Related papers (2020-10-30T03:04:22Z) - Object Detection with a Unified Label Space from Multiple Datasets [94.33205773893151]
Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces.
Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset.
Some categories, like face here, would thus be considered foreground in one dataset, but background in another.
We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels.
arXiv Detail & Related papers (2020-08-15T00:51:27Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.