Related papers: Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

URL: http://arxiv.org/abs/2402.06282v6
Date: Wed, 05 Feb 2025 21:58:32 GMT
Title: Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
Authors: Riccardo Cappuzzo, Aimee Coelho, Felix Lefebvre, Paolo Papotti, Gael Varoquaux,
Abstract summary: We present an in-depth analysis of automated table augmentation for machine learning tasks.<n>We analyze different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table.<n>We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake)
Score: 7.449868392714658
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving join candidates, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes.

Related papers

Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System [8.096082871461311]
Pneuma is a retrieval-augmented generation (RAG) system designed to efficiently and effectively discover tabular data. For table representation, Pneuma preserves schema and row-level information to ensure comprehensive data understanding. For table retrieval, Pneuma augments LLMs with traditional information retrieval techniques, such as full-text and vector search.
arXiv Detail & Related papers (2025-04-12T13:20:50Z)
TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models [57.005158277893194]
TableLoRA is a module designed to improve LLMs' understanding of table structure during PEFT. It incorporates special tokens for serializing tables with special token encoder and uses 2D LoRA to encode low-rank information on cell positions.
arXiv Detail & Related papers (2025-03-06T12:50:14Z)
TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes. We finetune the pretrained model for identifying unionable, joinable, and subset table pairs. Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. TACT contains challenging instructions that demand stitching information scattered across one or more texts. We construct this dataset by leveraging an existing dataset of texts and their associated tables. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z)
Squeezing Lemons with Hammers: An Evaluation of AutoML and Tabular Deep Learning for Data-Scarce Classification Applications [2.663744975320783]
We find that L2-regularized logistic regression performs similar to state-of-the-art automated machine learning (AutoML) frameworks. We recommend to consider logistic regression as the first choice for data-scarce applications.
arXiv Detail & Related papers (2024-05-13T11:43:38Z)
An Automatic Prompt Generation System for Tabular Data Tasks [3.117741687220381]
Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training.
arXiv Detail & Related papers (2024-05-09T08:32:55Z)
TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios [51.66718740300016]
TableLLM is a robust large language model (LLM) with 8 billion parameters. TableLLM is purpose-built for proficiently handling data manipulation tasks. We have released the model checkpoint, source code, benchmarks, and a web application for user interaction.
arXiv Detail & Related papers (2024-03-28T11:21:12Z)
Relational Deep Learning: Graph Representation Learning on Relational Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z)
Semantic Data Management in Data Lakes [0.0]
In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics. One way to prevent data lakes from turning into inoperable data swamps is semantic data management. We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access.
arXiv Detail & Related papers (2023-10-23T21:16:50Z)
LakeBench: Benchmarks for Data Discovery over Data Lakes [21.32260396393041]
We develop benchmarks for finding related tables in data repositories. We use tables drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank. None of the existing models had been trained on the data discovery tasks that we developed for this benchmark.
arXiv Detail & Related papers (2023-07-09T16:16:11Z)
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description. To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set. This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z)
METAM: Goal-Oriented Data Discovery [9.73435089036831]
METAM is a goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks.
arXiv Detail & Related papers (2023-04-18T15:42:25Z)
Deep Lake: a Lakehouse for Deep Learning [0.0]
Deep Lake is an open-source lakehouse for deep learning applications developed at Activeloop. This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop.
arXiv Detail & Related papers (2022-09-22T05:04:09Z)
LiDAR dataset distillation within bayesian active learning framework: Understanding the effect of data augmentation [63.20765930558542]
Active learning (AL) has re-gained attention recently to address reduction of annotation costs and dataset size. This paper performs a principled evaluation of AL based dataset distillation on (1/4th) of the large Semantic-KITTI dataset. We observe that data augmentation achieves full dataset accuracy using only 60% of samples from the selected dataset configuration.
arXiv Detail & Related papers (2022-02-06T00:04:21Z)
Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields. Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.