Related papers: Handling Missing Data in Decision Trees: A Probabilistic Approach

Handling Missing Data in Decision Trees: A Probabilistic Approach

URL: http://arxiv.org/abs/2006.16341v1
Date: Mon, 29 Jun 2020 19:54:54 GMT
Title: Handling Missing Data in Decision Trees: A Probabilistic Approach
Authors: Pasha Khosravi, Antonio Vergari, YooJung Choi, Yitao Liang, Guy Van den Broeck
Abstract summary: We tackle the problem of handling missing data in decision trees by taking a probabilistic approach. We use tractable density estimators to compute the "expected prediction" of our models. At learning time, we fine-tune parameters of already learned trees by minimizing their "expected prediction loss"
Score: 41.259097100704324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decision trees are a popular family of models due to their attractive properties such as interpretability and ability to handle heterogeneous data. Concurrently, missing data is a prevalent occurrence that hinders performance of machine learning models. As such, handling missing data in decision trees is a well studied problem. In this paper, we tackle this problem by taking a probabilistic approach. At deployment time, we use tractable density estimators to compute the "expected prediction" of our models. At learning time, we fine-tune parameters of already learned trees by minimizing their "expected prediction loss" w.r.t.\ our density estimators. We provide brief experiments showcasing effectiveness of our methods compared to few baselines.

Related papers

Collaborative Prediction: To Join or To Disjoin Datasets [5.9697789282446605]
We study the problem of developing practical algorithms that select appropriate dataset to minimize population loss.<n>By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability.
arXiv Detail & Related papers (2025-06-12T20:25:07Z)
Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z)
Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased [0.0]
Imbalanced binary classification problems arise in many fields of study. It is common to subsample the majority class to create a (more) balanced dataset for model training. This biases the model's predictions because the model learns from a dataset that does not follow the same data generating process as new data.
arXiv Detail & Related papers (2024-12-17T19:38:29Z)
Decision Trees for Interpretable Clusters in Mixture Models and Deep Representations [5.65604054654671]
We introduce the notion of an explainability-to-noise ratio for mixture models. We propose an algorithm that takes as input a mixture model and constructs a suitable tree in data-independent time. We prove upper and lower bounds on the error rate of the resulting decision tree.
arXiv Detail & Related papers (2024-11-03T14:00:20Z)
Estimating Causal Effects from Learned Causal Networks [56.14597641617531]
We propose an alternative paradigm for answering causal-effect queries over discrete observable variables. We learn the causal Bayesian network and its confounding latent variables directly from the observational data. We show that this emphmodel completion learning approach can be more effective than estimand approaches.
arXiv Detail & Related papers (2024-08-26T08:39:09Z)
Treeffuser: Probabilistic Predictions via Conditional Diffusions with Gradient-Boosted Trees [39.9546129327526]
Treeffuser is an easy-to-use method for probabilistic prediction on tabular data. Treeffuser learns well-calibrated predictive distributions and can handle a wide range of regression tasks. We demonstrate its versatility with an application to inventory allocation under uncertainty using sales data from Walmart.
arXiv Detail & Related papers (2024-06-11T18:59:24Z)
Prediction Algorithms Achieving Bayesian Decision Theoretical Optimality Based on Decision Trees as Data Observation Processes [1.2774526936067927]
This paper uses trees to represent data observation processes behind given data. We derive the statistically optimal prediction, which is robust against overfitting. We solve this by a Markov chain Monte Carlo method, whose step size is adaptively tuned according to a posterior distribution for the trees.
arXiv Detail & Related papers (2023-06-12T12:14:57Z)
Uncertainty estimation of pedestrian future trajectory using Bayesian approximation [137.00426219455116]
Under dynamic traffic scenarios, planning based on deterministic predictions is not trustworthy. The authors propose to quantify uncertainty during forecasting using approximation which deterministic approaches fail to capture. The effect of dropout weights and long-term prediction on future state uncertainty has been studied.
arXiv Detail & Related papers (2022-05-04T04:23:38Z)
Distributionally Robust Semi-Supervised Learning Over Graphs [68.29280230284712]
Semi-supervised learning (SSL) over graph-structured data emerges in many network science applications. To efficiently manage learning over graphs, variants of graph neural networks (GNNs) have been developed recently. Despite their success in practice, most of existing methods are unable to handle graphs with uncertain nodal attributes. Challenges also arise due to distributional uncertainties associated with data acquired by noisy measurements. A distributionally robust learning framework is developed, where the objective is to train models that exhibit quantifiable robustness against perturbations.
arXiv Detail & Related papers (2021-10-20T14:23:54Z)
A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds [9.546094657606178]
We study the generalization performance of decision trees with respect to different generative regression models. This allows us to elicit their inductive bias, that is, the assumptions the algorithms make (or do not make) to generalize to new data. We prove a sharp squared error generalization lower bound for a large class of decision tree algorithms fitted to sparse additive models.
arXiv Detail & Related papers (2021-10-18T21:22:40Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression [51.770998056563094]
Probabilistic Gradient Boosting Machines (PGBM) is a method to create probabilistic predictions with a single ensemble of decision trees. We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-03T08:32:13Z)
Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem [1.5749416770494704]
A common empirical strategy involves the application of predictive modeling techniques to'mine' variables of interest from available data. Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest.
arXiv Detail & Related papers (2020-12-19T21:48:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.