Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data
- URL: http://arxiv.org/abs/2509.21465v1
- Date: Thu, 25 Sep 2025 19:30:39 GMT
- Title: Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data
- Authors: George Yakushev, Alina Shutova, Ivan Rubachev, Renat Sergazinov, Artem Babenko,
- Abstract summary: Tabular foundation models are increasingly popular for low-resource problems.<n>These models make up for small training datasets by pretraining on large volumes of synthetic data.<n>In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees.
- Score: 21.280488775409513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly to inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in agentic setup. We design a minimal set of tools for constructing, analyzing and manipulating decision trees. By using these tools, LLMs combine their prior knowledge with learning from data to create a lightweight decision tree that outperforms traditional CART on low-resource tabular problems. While a single decision tree does not outperform state-of-the-art black box models, it comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input: correcting biases or incorporating domain-specific intuition that is not captured in the data.
Related papers
- SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z) - TabReason: A Reinforcement Learning-Enhanced Reasoning LLM for Explainable Tabular Data Prediction [19.350413252699042]
Large language models (LLMs) have demonstrated powerful capabilities to generate human-like reasoning and explanations.<n>We propose a new approach that leverages reasoning-based LLMs, trained using reinforcement learning, to perform more accurate and explainable predictions.<n>Our method introduces custom reward functions that guide the model not only toward better prediction accuracy but also toward human-understandable reasons for its predictions.
arXiv Detail & Related papers (2025-05-27T22:23:11Z) - LLM Meeting Decision Trees on Tabular Data [14.527458439318725]
Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc.<n>With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed.
arXiv Detail & Related papers (2025-05-23T13:57:53Z) - Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data.<n>We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z) - Zero-Shot Decision Tree Construction via Large Language Models [2.005837558796176]
We introduce an algorithm for constructing decision trees using large language models (LLMs) in a zero-shot manner based on Classification and Regression Trees (CART) principles.<n>Our approach leverages LLMs to perform operations essential for decision tree construction, including attribute discretization, probability calculation, and Gini index computation.
arXiv Detail & Related papers (2025-01-27T17:48:48Z) - "Oh LLM, I'm Asking Thee, Please Give Me a Decision Tree": Zero-Shot Decision Tree Induction and Embedding with Large Language Models [1.742301293487176]
Large language models (LLMs) provide powerful means to leverage prior knowledge for predictive modeling when data is limited.<n>In this work, we demonstrate how LLMs can use their compressed world knowledge to generate intrinsically interpretable machine learning models.
arXiv Detail & Related papers (2024-09-27T09:53:48Z) - Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning [53.241569810013836]
We propose a novel framework that utilizes large language models (LLMs) to identify effective feature generation rules.
We use decision trees to convey this reasoning information, as they can be easily represented in natural language.
OCTree consistently enhances the performance of various prediction models across diverse benchmarks.
arXiv Detail & Related papers (2024-06-12T08:31:34Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.