LLM Meeting Decision Trees on Tabular Data
- URL: http://arxiv.org/abs/2505.17918v1
- Date: Fri, 23 May 2025 13:57:53 GMT
- Title: LLM Meeting Decision Trees on Tabular Data
- Authors: Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, Yi Chang,
- Abstract summary: Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc.<n>With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed.
- Score: 14.527458439318725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data throughough logical decision tree rules as intermediaries, proposes a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of ``errors'' reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.
Related papers
- Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z) - Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models [15.603556124006479]
We propose retrieval-augmented language models for scalable TabICL.<n>Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs.<n>This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets.
arXiv Detail & Related papers (2025-02-05T13:16:41Z) - Transfer Learning of Tabular Data by Finetuning Large Language Models [0.0]
This paper investigates the effectiveness of an application programming interface (API) and transfer learning of large language models (LLM)<n>LLM APIs respond to input text prompts with tokenized data and instructions, whereas transfer learning finetunes an LLM for a target classification task.<n>This paper proposes an end-to-end finetuning of LLM to demonstrate cross-data transfer learning on ten benchmark data sets.
arXiv Detail & Related papers (2025-01-12T16:23:18Z) - Through the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models [0.0]
Large Language Models (LLMs) have shown exceptional performance in text processing.
This paper proposes a novel approach to training LLMs using knowledge transfer from a random forest (RF) ensemble.
We generate outputs for fine-tuning, enhancing the model's ability to classify and explain its decisions.
arXiv Detail & Related papers (2024-06-07T13:31:51Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - Large Language Models for Data Annotation and Synthesis: A Survey [49.8318827245266]
This survey focuses on the utility of Large Language Models for data annotation and synthesis.<n>It includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis.
arXiv Detail & Related papers (2024-02-21T00:44:04Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Incorporating LLM Priors into Tabular Learners [6.835834518970967]
We introduce two strategies utilizing Large Language Models (LLMs) for ranking categorical variables.
We focus on Logistic Regression, introducing MonotonicLR that employs a non-linear monotonic function for mapping ordinals to cardinals.
arXiv Detail & Related papers (2023-11-20T09:27:09Z) - Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning.
We propose a novel method, termed "reflection-tuning"
This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.