Related papers: Generative Benchmark Creation for Table Union Search

Generative Benchmark Creation for Table Union Search

URL: http://arxiv.org/abs/2308.03883v1
Date: Mon, 7 Aug 2023 19:26:09 GMT
Title: Generative Benchmark Creation for Table Union Search
Authors: Koyena Pal, Aamod Khatiwada, Roee Shraga, Ren\'ee J. Miller
Abstract summary: We present a novel method for using generative models to create tables with specified properties. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks.
Score: 4.970364068620607
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create tables with specified properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable but related. We thoroughly evaluate recent existing table union search methods over existing benchmarks and our new benchmark. We also present and evaluate a new table search methods based on recent large language models over all benchmarks. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks, specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks. We examine why this is the case and show that the new benchmark permits more detailed analysis of methods, including a study of both false positives and false negatives that were not possible with existing benchmarks.

Related papers

Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks [0.9374652839580181]
Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes.<n>These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks.<n>We propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.
arXiv Detail & Related papers (2025-05-27T15:23:52Z)
Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text [21.699434525769586]
Existing measures for table quality evaluation fail to capture the overall semantics of the tables. We propose TabEval, a novel table evaluation strategy that captures table semantics. To validate our approach, we curate a dataset comprising of text descriptions for 1,250 diverse Wikipedia tables.
arXiv Detail & Related papers (2024-06-21T02:18:03Z)
TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios [52.73289223176475]
TableLLM is a robust large language model (LLM) with 13 billion parameters. TableLLM is purpose-built for proficiently handling data manipulation tasks. We have released the model checkpoint, source code, benchmarks, and a web application for user interaction.
arXiv Detail & Related papers (2024-03-28T11:21:12Z)
Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs. We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric. Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences. Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z)
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Small but Mighty: New Benchmarks for Split and Rephrase [18.959219419951083]
Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues. We show that even a simple rule-based model can perform on par with the state-of-the-art model.
arXiv Detail & Related papers (2020-09-17T23:37:33Z)
ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples. We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
Leveraging Schema Labels to Enhance Dataset Search [20.63182827636973]
We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which considers the relevance between the query and dataset metadata. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task.
arXiv Detail & Related papers (2020-01-27T22:41:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.