Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
- URL: http://arxiv.org/abs/2402.06282v4
- Date: Mon, 27 May 2024 13:21:05 GMT
- Title: Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
- Authors: Riccardo Cappuzzo, Aimee Coelho, Felix Lefebvre, Paolo Papotti, Gael Varoquaux,
- Abstract summary: We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table.
As data lakes, the paper uses YADL (Yet Another Data Lake) and Open Data US, a well-referenced real data lake.
- Score: 7.449868392714658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an in-depth analysis of data discovery in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset we developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration on both lakes, our study outlines the importance of accurately retrieving join candidates and the efficiency of simple merging methods. We report new insights on the benefits of existing solutions and on their limitations, aiming at guiding future research in this space.
Related papers
- TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.
TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.
Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes.
We finetune the pretrained model for identifying unionable, joinable, and subset table pairs.
Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Semantic Data Management in Data Lakes [0.0]
In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics.
One way to prevent data lakes from turning into inoperable data swamps is semantic data management.
We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access.
arXiv Detail & Related papers (2023-10-23T21:16:50Z) - LakeBench: Benchmarks for Data Discovery over Data Lakes [21.32260396393041]
We develop benchmarks for finding related tables in data repositories.
We use tables drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank.
None of the existing models had been trained on the data discovery tasks that we developed for this benchmark.
arXiv Detail & Related papers (2023-07-09T16:16:11Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - METAM: Goal-Oriented Data Discovery [9.73435089036831]
METAM is a goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process.
We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks.
arXiv Detail & Related papers (2023-04-18T15:42:25Z) - Deep Lake: a Lakehouse for Deep Learning [0.0]
Deep Lake is an open-source lakehouse for deep learning applications developed at Activeloop.
This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop.
arXiv Detail & Related papers (2022-09-22T05:04:09Z) - LiDAR dataset distillation within bayesian active learning framework:
Understanding the effect of data augmentation [63.20765930558542]
Active learning (AL) has re-gained attention recently to address reduction of annotation costs and dataset size.
This paper performs a principled evaluation of AL based dataset distillation on (1/4th) of the large Semantic-KITTI dataset.
We observe that data augmentation achieves full dataset accuracy using only 60% of samples from the selected dataset configuration.
arXiv Detail & Related papers (2022-02-06T00:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.