LakeBench: Benchmarks for Data Discovery over Data Lakes
- URL: http://arxiv.org/abs/2307.04217v1
- Date: Sun, 9 Jul 2023 16:16:11 GMT
- Title: LakeBench: Benchmarks for Data Discovery over Data Lakes
- Authors: Kavitha Srinivas, Julian Dolby, Ibrahim Abdelaziz, Oktie Hassanzadeh,
Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Subhajit Chaudhury, Horst
Samulowitz
- Abstract summary: We develop benchmarks for finding related tables in data repositories.
We use tables drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank.
None of the existing models had been trained on the data discovery tasks that we developed for this benchmark.
- Score: 21.32260396393041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Within enterprises, there is a growing need to intelligently navigate data
lakes, specifically focusing on data discovery. Of particular importance to
enterprises is the ability to find related tables in data repositories. These
tables can be unionable, joinable, or subsets of each other. There is a dearth
of benchmarks for these tasks in the public domain, with related work targeting
private datasets. In LakeBench, we develop multiple benchmarks for these tasks
by using the tables that are drawn from a diverse set of data sources such as
government data from CKAN, Socrata, and the European Central Bank. We compare
the performance of 4 publicly available tabular foundational models on these
tasks. None of the existing models had been trained on the data discovery tasks
that we developed for this benchmark; not surprisingly, their performance shows
significant room for improvement. The results suggest that the establishment of
such benchmarks may be useful to the community to build tabular models usable
for data discovery in data lakes.
Related papers
- BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes.
We finetune the pretrained model for identifying unionable, joinable, and subset table pairs.
Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z) - 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [67.47600679176963]
RDBs store vast amounts of rich, informative data spread across interconnected tables.
The progress of predictive machine learning models falls behind advances in other domains such as computer vision or natural language processing.
We explore a class of baseline models predicated on converting multi-table datasets into graphs.
We assemble a diverse collection of large-scale RDB datasets and (ii) coincident predictive tasks.
arXiv Detail & Related papers (2024-04-28T15:04:54Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios [52.73289223176475]
TableLLM is a robust large language model (LLM) with 13 billion parameters.
TableLLM is purpose-built for proficiently handling data manipulation tasks.
We have released the model checkpoint, source code, benchmarks, and a web application for user interaction.
arXiv Detail & Related papers (2024-03-28T11:21:12Z) - Retrieve, Merge, Predict: Augmenting Tables with Data Lakes [7.449868392714658]
We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table.
As data lakes, the paper uses YADL (Yet Another Data Lake) and Open Data US, a well-referenced real data lake.
arXiv Detail & Related papers (2024-02-09T09:48:38Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Characterizing Transactional Databases for Frequent Itemset Mining [0.0]
This paper presents a study of the characteristics of transactional databases used in frequent itemset mining.
Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones.
We provide a set of representative datasets based on our characterization that may be used as a benchmark safely.
arXiv Detail & Related papers (2020-11-09T12:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.