Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science
- URL: http://arxiv.org/abs/2403.20208v6
- Date: Sat, 13 Jul 2024 08:25:16 GMT
- Title: Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science
- Authors: Yazheng Yang, Yuqi Wang, Sankalok Sen, Lei Li, Qi Liu,
- Abstract summary: This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks.
Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2.
- Score: 17.910306140400046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the domain of data science, the predictive tasks of classification, regression, and imputation of missing values are commonly encountered challenges associated with tabular data. This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks. Despite their proficiency in comprehending natural language, LLMs fall short in dealing with structured tabular data. This limitation stems from their lacking exposure to the intricacies of tabular data during their foundational training. Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2 on this enriched dataset. Furthermore, we investigate the practical application of applying the trained model to zero-shot prediction, few-shot prediction, and in-context learning scenarios. Through extensive experiments, our methodology has shown significant improvements over existing benchmarks. These advancements highlight the efficacy of tailoring LLM training to solve table-related problems in data science, thereby establishing a new benchmark in the utilization of LLMs for enhancing tabular intelligence.
Related papers
- How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Enhancing Temporal Understanding in LLMs for Semi-structured Tables [50.59009084277447]
We conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of large language models (LLMs)
Our investigation leads to enhancements in TempTabQA, a dataset specifically designed for temporal temporal question answering.
We introduce a novel approach, C.L.E.A.R. to strengthen LLM capabilities in this domain.
arXiv Detail & Related papers (2024-07-22T20:13:10Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs)
LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z) - CLAIM Your Data: Enhancing Imputation Accuracy with Contextual Large Language Models [0.18416014644193068]
This paper introduces the Contextual Language model for Accurate Imputation Method (CLAIM)
Unlike traditional imputation methods, CLAIM utilizes contextually relevant natural language descriptors to fill missing values.
Our evaluations across diverse datasets and missingness patterns reveal CLAIM's superior performance over existing imputation techniques.
arXiv Detail & Related papers (2024-05-28T00:08:29Z) - Elephants Never Forget: Testing Language Models for Memorization of
Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.
We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z) - From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models [18.219485459836285]
Generative Tabular Learning (GTL) is a novel framework that integrates the advanced functionalities of large language models (LLMs)
Our empirical study spans 384 public datasets, rigorously analyzing GTL's scaling behaviors.
GTL-LLaMA-2 model demonstrates superior zero-shot and in-context learning capabilities across numerous classification and regression tasks.
arXiv Detail & Related papers (2023-10-11T09:37:38Z) - Large Language Models as Data Preprocessors [9.99065004972981]
Large Language Models (LLMs) have marked a significant advancement in artificial intelligence.
This study explores their potential in data preprocessing, a critical stage in data mining and analytics applications.
We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques.
arXiv Detail & Related papers (2023-08-30T23:28:43Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.