Related papers: Leveraging Schema Labels to Enhance Dataset Search

Leveraging Schema Labels to Enhance Dataset Search

URL: http://arxiv.org/abs/2001.10112v1
Date: Mon, 27 Jan 2020 22:41:02 GMT
Title: Leveraging Schema Labels to Enhance Dataset Search
Authors: Zhiyu Chen, Haiyan Jia, Jeff Heflin, Brian D. Davison
Abstract summary: We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which considers the relevance between the query and dataset metadata. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task.
Score: 20.63182827636973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text.We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well.

Related papers

Schema Inference for Tabular Data Repositories Using Large Language Models [12.626848016550051]
We present SI-LLM, which infers a concise conceptual schema for data using only column headers and cell values.<n> SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step.
arXiv Detail & Related papers (2025-09-04T19:50:16Z)
Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents [7.616682226138909]
We introduce the task of intent-based chart generation from documents.<n>The goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting.<n>We propose an attribution-based metric that uses a structured textual representation of charts.
arXiv Detail & Related papers (2025-07-20T04:34:59Z)
UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification [50.59009084277447]
We introduce UNJOIN, a framework that decouples the retrieval of schema elements from logic generation.<n>In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name.<n>In the second stage, the query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic.
arXiv Detail & Related papers (2025-05-23T17:28:43Z)
TARGET: Benchmarking Table Retrieval for Generative Tasks [7.379012456053551]
TARGET is a benchmark for evaluating TAble Retrieval for GEnerative Tasks.<n>We analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks.<n>We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text.
arXiv Detail & Related papers (2025-05-14T19:39:46Z)
Schema Matching with Large Language Models: an Experimental Study [0.580553237364985]
We investigate the use of an off-the-shelf Large Language Models (LLMs) for schema matching. Our objective is to identify semantic correspondences between elements of two relational schemas using only names and descriptions.
arXiv Detail & Related papers (2024-07-16T15:33:00Z)
Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu) DAQu augments the original query with various (query-related) metadata across multiple tables. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z)
Standardness Fogs Meaning: A Position Regarding the Informed Usage of Standard Datasets [0.5497663232622965]
We evaluate the match between use case, derived categories, and labels of standard datasets. For the 20 Newsgroups dataset, we demonstrate that the labels are imprecise. We conclude that a concept of standardness of a dataset implies that there is a match between use case, derived categories, and class labels.
arXiv Detail & Related papers (2024-06-19T13:39:05Z)
QueryNER: Segmentation of E-commerce Queries [12.563241705572409]
We present a manually-annotated dataset and accompanying model for e-commerce query segmentation. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types.
arXiv Detail & Related papers (2024-05-15T16:58:35Z)
Wiki-TabNER: Integrating Named Entity Recognition into Wikipedia Tables [18.330753799139845]
A new dataset, Wiki-TabNER, is proposed to enrich the existing benchmark datasets.<n>This paper describes the distinguishing features of the Wiki-TabNER dataset and the labeling process.<n>In addition, we propose a prompting framework for evaluating the new large language models on the within tables NER task.
arXiv Detail & Related papers (2024-03-07T15:22:07Z)
ReMatch: Retrieval Enhanced Schema Matching with LLMs [0.874967598360817]
We present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs) Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher.
arXiv Detail & Related papers (2024-03-03T17:14:40Z)
Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings. We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z)
Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks. We then tailor the labeling model specifically to the task of entity matching. We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z)
Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets. We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy. Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z)
Semantic Labeling Using a Deep Contextualized Language Model [9.719972529205101]
We propose a context-aware semantic labeling method using both the column values and context. Our new method is based on a new setting for semantic labeling, where we sequentially predict labels for an input table with missing headers. To our knowledge, we are the first to successfully apply BERT to solve the semantic labeling task.
arXiv Detail & Related papers (2020-10-30T03:04:22Z)
Object Detection with a Unified Label Space from Multiple Datasets [94.33205773893151]
Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset. Some categories, like face here, would thus be considered foreground in one dataset, but background in another. We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels.
arXiv Detail & Related papers (2020-08-15T00:51:27Z)
ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples. We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.