Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection
- URL: http://arxiv.org/abs/2512.07246v1
- Date: Mon, 08 Dec 2025 07:40:48 GMT
- Title: Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection
- Authors: Mengqi Wang, Jianwei Wang, Qing Liu, Xiwei Xu, Zhenchang Xing, Liming Zhu, Wenjie Zhang,
- Abstract summary: Error detection is important for ensuring data quality.<n>Recent state-of-the-art ED methods leverage the pre-trained knowledge and semantic capability embedded in large language models (LLMs) to directly label whether a cell is erroneous.<n>We propose an LLM-as-an-inducer framework that adopts LLM to induce the decision tree for ED (termed TreeED) and further ensembles multiple such trees for consensus detection (termed ForestED)<n>Our methods are accurate, explainable and robust, achieving an average F1-score improvement of 16.1% over the best baseline.
- Score: 24.742137117129502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Error detection (ED), which aims to identify incorrect or inconsistent cell values in tabular data, is important for ensuring data quality. Recent state-of-the-art ED methods leverage the pre-trained knowledge and semantic capability embedded in large language models (LLMs) to directly label whether a cell is erroneous. However, this LLM-as-a-labeler pipeline (1) relies on the black box, implicit decision process, thus failing to provide explainability for the detection results, and (2) is highly sensitive to prompts, yielding inconsistent outputs due to inherent model stochasticity, therefore lacking robustness. To address these limitations, we propose an LLM-as-an-inducer framework that adopts LLM to induce the decision tree for ED (termed TreeED) and further ensembles multiple such trees for consensus detection (termed ForestED), thereby improving explainability and robustness. Specifically, based on prompts derived from data context, decision tree specifications and output requirements, TreeED queries the LLM to induce the decision tree skeleton, whose root-to-leaf decision paths specify the stepwise procedure for evaluating a given sample. Each tree contains three types of nodes: (1) rule nodes that perform simple validation checks (e.g., format or range), (2) Graph Neural Network (GNN) nodes that capture complex patterns (e.g., functional dependencies), and (3) leaf nodes that output the final decision types (error or clean). Furthermore, ForestED employs uncertainty-based sampling to obtain multiple row subsets, constructing a decision tree for each subset using TreeED. It then leverages an Expectation-Maximization-based algorithm that jointly estimates tree reliability and optimizes the consensus ED prediction. Extensive xperiments demonstrate that our methods are accurate, explainable and robust, achieving an average F1-score improvement of 16.1% over the best baseline.
Related papers
- Step-by-Step Causality: Transparent Causal Discovery with Multi-Agent Tree-Query and Adversarial Confidence Estimation [10.652998143672658]
Tree-Query is a tree-structured, multi-expert LLM framework that reduces pairwise causal discovery to a short sequence of queries.<n> Theoretical guarantees are provided for identifiability of four pairwise relations.
arXiv Detail & Related papers (2026-01-15T07:28:59Z) - Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration [52.52685988964061]
Entropy-Tree is a tree-based decoding method that exploits entropy as a signal for branching decisions.<n>It unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure.
arXiv Detail & Related papers (2026-01-02T07:14:05Z) - Decision Tree Embedding by Leaf-Means [11.318593165494724]
Decision Tree Embedding (DTE) is a fast and effective method that leverages the leaf partitions of a trained classification tree to construct an interpretable feature representation.<n>By using the sample means within each leaf region as anchor points, DTE maps inputs into an embedding space defined by the tree's partition structure.<n>We establish several population-level theoretical properties of DTE, including its preservation of conditional density under mild conditions.
arXiv Detail & Related papers (2025-12-01T15:57:33Z) - Node-Level Uncertainty Estimation in LLM-Generated SQL [13.436696325103147]
We introduce a semantically aware labeling algorithm that assigns node-level correctness without over-penalizing structural containers or alias variation.<n>We represent each node with a rich set of schema-aware and lexical features - capturing identifier validity, alias resolution, type compatibility, ambiguity in scope, and typo signals.<n>We interpret these probabilities as uncertainty, enabling fine-grained diagnostics that pinpoint exactly where a query is likely to be wrong.
arXiv Detail & Related papers (2025-11-17T23:31:45Z) - Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models [13.433506313486701]
Tree search has emerged as a powerful framework for aligning generative models with task-specific rewards at test time.<n>We propose TReASURe, a tree-search test-time alignment method that addresses these issues.<n>TReASURe achieves state-of-the-art results on perplexity, linguistic acceptability, and control of sentiment and toxicity.
arXiv Detail & Related papers (2025-09-27T06:22:45Z) - ZTree: A Subgroup Identification Based Decision Tree Learning Framework [3.119681354260829]
We propose ZTree, a novel decision tree learning framework.<n>It replaces CART's traditional purity based splitting with statistically principled subgroup identification.<n>ZTree consistently delivers strong performance, especially at low data regimes.
arXiv Detail & Related papers (2025-09-16T05:25:16Z) - Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data.<n>We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z) - Probabilistic Tree-of-thought Reasoning for Answering
Knowledge-intensive Complex Questions [93.40614719648386]
Large language models (LLMs) are capable of answering knowledge-intensive complex questions with chain-of-thought (CoT) reasoning.
Recent works turn to retrieving external knowledge to augment CoT reasoning.
We propose a novel approach: Probabilistic Tree-of-thought Reasoning (ProbTree)
arXiv Detail & Related papers (2023-11-23T12:52:37Z) - Tree Prompting: Efficient Task Adaptation without Fine-Tuning [112.71020326388029]
Tree Prompting builds a decision tree of prompts, linking multiple LM calls together to solve a task.
Experiments on classification datasets show that Tree Prompting improves accuracy over competing methods and is competitive with fine-tuning.
arXiv Detail & Related papers (2023-10-21T15:18:22Z) - Robustifying Algorithms of Learning Latent Trees with Vector Variables [92.18777020401484]
We present the sample complexities of Recursive Grouping (RG) and Chow-Liu Recursive Grouping (CLRG)
We robustify RG, CLRG, Neighbor Joining (NJ) and Spectral NJ (SNJ) by using the truncated inner product.
We derive the first known instance-dependent impossibility result for structure learning of latent trees.
arXiv Detail & Related papers (2021-06-02T01:37:52Z) - Growing Deep Forests Efficiently with Soft Routing and Learned
Connectivity [79.83903179393164]
This paper further extends the deep forest idea in several important aspects.
We employ a probabilistic tree whose nodes make probabilistic routing decisions, a.k.a., soft routing, rather than hard binary decisions.
Experiments on the MNIST dataset demonstrate that our empowered deep forests can achieve better or comparable performance than [1],[3].
arXiv Detail & Related papers (2020-12-29T18:05:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.