Related papers: Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day

Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day

URL: http://arxiv.org/abs/2511.23220v1
Date: Fri, 28 Nov 2025 14:26:46 GMT
Title: Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day
Authors: Milad Abdollahzadeh, Abdul Raheem, Zilong Zhao, Uzair Javaid, Kevin Yee, Nalam Venkata Abhishek, Tram Truong-Huu, Biplab Sikdar,
Abstract summary: Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data.<n>In this work, we explore the efficacy of instruction tuning in improving tabular data generation capabilities.<n>Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.
Score: 9.944627235801223
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.

Related papers

TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation [50.319535974012]
Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets.<n>Most available datasets for supervised and preference fine-tuning are in English.<n>We propose the underlinetextbfTaxonomy-Guided underlinetextbfPreference Data Generation framework.
arXiv Detail & Related papers (2025-06-30T15:45:28Z)
Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models [22.16558378953053]
We build state-of-the-art instruction-tuning datasets sourced from human-written instructions.<n>LLMs fine-tuned on our datasets consistently outperform those fine-tuned on existing ones.<n>Analyses suggest that instruction-tuning in a new language allows LLMs to follow instructions, while the tuned models exhibit a notable lack of culture-specific knowledge in that language.
arXiv Detail & Related papers (2025-03-31T04:28:38Z)
Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models [15.603556124006479]
We propose retrieval-augmented language models for scalable TabICL.<n>Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs.<n>This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets.
arXiv Detail & Related papers (2025-02-05T13:16:41Z)
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates [57.29125360837203]
Cookbook is a framework that generates training data consisting of simple patterns over random tokens. We find that finetuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points.
arXiv Detail & Related papers (2024-10-07T17:29:40Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps. A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z)
TabuLa: Harnessing Language Models for Tabular Data Synthesis [4.539846270369207]
Tabula is a tabular data synthesizer that leverages the structure of large language models (LLMs)<n>Unlike state-of-the-art (SOTA) LLMs, Tabula discards the pre-trained weights originally designed for natural language tasks.<n> experiments show that Tabula achieves superior synthetic data utility compared to current SOTA methods.
arXiv Detail & Related papers (2023-10-19T13:50:56Z)
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation [92.2167864437497]
We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
arXiv Detail & Related papers (2023-05-23T17:56:26Z)
TABLET: Learning From Instructions For Tabular Data [46.62140500101618]
We introduce TABLET, a benchmark of 20 diverse datasets annotated with instructions that vary in their phrasing, granularity, and technicality. We find in-context instructions increase zero-shot F1 performance for Flan-T5 11b by 44% on average and 13% for ChatGPT on TABLET.
arXiv Detail & Related papers (2023-04-25T23:07:20Z)
LiDAR dataset distillation within bayesian active learning framework: Understanding the effect of data augmentation [63.20765930558542]
Active learning (AL) has re-gained attention recently to address reduction of annotation costs and dataset size. This paper performs a principled evaluation of AL based dataset distillation on (1/4th) of the large Semantic-KITTI dataset. We observe that data augmentation achieves full dataset accuracy using only 60% of samples from the selected dataset configuration.
arXiv Detail & Related papers (2022-02-06T00:04:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.