Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability
- URL: http://arxiv.org/abs/2307.01932v2
- Date: Fri, 23 May 2025 14:52:12 GMT
- Title: Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability
- Authors: Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu,
- Abstract summary: We use a framework called RF+ to combine strengths of RFs and generalized linear models.<n>We show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features.
- Score: 9.128252505139471
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Random forests (RFs) are among the most popular supervised learning algorithms due to their nonlinear flexibility and ease-of-use. However, as black box models, they can only be interpreted via algorithmically-defined feature importance methods, such as Mean Decrease in Impurity (MDI), which have been observed to be highly unstable and have ambiguous scientific meaning. Furthermore, they can perform poorly in the presence of smooth or additive structure. To address this, we reinterpret decision trees and MDI as linear regression and $R^2$ values, respectively, with respect to engineered features associated with the tree's decision splits. This allows us to combine the respective strengths of RFs and generalized linear models in a framework called RF+, which also yields an improved feature importance method we call MDI+. Through extensive data-inspired simulations and real-world datasets, we show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features, often yielding more than a 10% improvement over its closest competitor. In case studies on drug response prediction and breast cancer subtyping, we further show that MDI+ extracts well-established genes with significantly greater stability compared to existing feature importance measures.
Related papers
- Local MDI+: Local Feature Importances for Tree-Based Models [8.532396185972392]
Local MDI+ (LMDI+) is a novel extension of the MDI+ framework to the sample specific setting.<n>It produces similar instance-level feature importance rankings across multiple random forest fits.<n>It also enables local interpretability use cases, including the identification of closer counterfactuals.
arXiv Detail & Related papers (2025-06-10T15:51:27Z) - RefiDiff: Refinement-Aware Diffusion for Efficient Missing Data Imputation [13.401822039640297]
Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation.<n>We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network.<n>RefiDiff outperforms state-the-art (SOTA) methods across missing-value settings with a 4x faster training time than DDPM-based approaches.
arXiv Detail & Related papers (2025-05-20T14:51:07Z) - Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities.
Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality.
We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.
Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.
We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - Enhancing Retail Sales Forecasting with Optimized Machine Learning Models [0.0]
In retail sales forecasting, accurately predicting future sales is crucial for inventory management and strategic planning.<n>Recent advancements in machine learning (ML) provide more robust alternatives.<n>This research benefits from the power of ML, particularly Random Forest (RF), Gradient Boosting (GB), Support Vector Regression (SVR), and XGBoost.
arXiv Detail & Related papers (2024-10-17T17:11:33Z) - Is the MMI Criterion Necessary for Interpretability? Degenerating Non-causal Features to Plain Noise for Self-Rationalization [17.26418974819275]
This paper develops a new criterion that treats spurious features as plain noise.
Experiments show that our MRD criterion improves rationale quality (measured by the overlap with human-annotated rationales) by up to $10.4%$ as compared to several recent competitive MMI variants.
arXiv Detail & Related papers (2024-10-08T13:04:02Z) - Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space [72.52365911990935]
We introduce Bellman Diffusion, a novel DGM framework that maintains linearity in MDPs through gradient and scalar field modeling.
Our results show that Bellman Diffusion achieves accurate field estimations and is a capable image generator, converging 1.5x faster than the traditional histogram-based baseline in distributional RL tasks.
arXiv Detail & Related papers (2024-10-02T17:53:23Z) - SMRD: SURE-based Robust MRI Reconstruction with Diffusion Models [76.43625653814911]
Diffusion models have gained popularity for accelerated MRI reconstruction due to their high sample quality.
They can effectively serve as rich data priors while incorporating the forward model flexibly at inference time.
We introduce SURE-based MRI Reconstruction with Diffusion models (SMRD) to enhance robustness during testing.
arXiv Detail & Related papers (2023-10-03T05:05:35Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information.
Test-time adaptive (TTA) methods are proposed to address this issue.
In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - uGLAD: Sparse graph recovery by optimizing deep unrolled networks [11.48281545083889]
We present a novel technique to perform sparse graph recovery by optimizing deep unrolled networks.
Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting.
We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.
arXiv Detail & Related papers (2022-05-23T20:20:27Z) - A Hybrid Framework for Sequential Data Prediction with End-to-End
Optimization [0.0]
We investigate nonlinear prediction in an online setting and introduce a hybrid model that effectively mitigates hand-designed features and manual model selection issues.
We employ a recurrent neural network (LSTM) for adaptive feature extraction from sequential data and a gradient boosting machinery (soft GBDT) for effective supervised regression.
We demonstrate the learning behavior of our algorithm on synthetic data and the significant performance improvements over the conventional methods over various real life datasets.
arXiv Detail & Related papers (2022-03-25T17:13:08Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Improving Generalization via Uncertainty Driven Perturbations [107.45752065285821]
We consider uncertainty-driven perturbations of the training data points.
Unlike loss-driven perturbations, uncertainty-guided perturbations do not cross the decision boundary.
We show that UDP is guaranteed to achieve the robustness margin decision on linear models.
arXiv Detail & Related papers (2022-02-11T16:22:08Z) - From global to local MDI variable importances for random forests and
when they are Shapley values [9.99125500568217]
We first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions.
We derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance.
arXiv Detail & Related papers (2021-11-03T13:38:41Z) - Overcoming Catastrophic Forgetting with Gaussian Mixture Replay [79.0660895390689]
We present a rehearsal-based approach for continual learning (CL) based on Gaussian Mixture Models (GMM)
We mitigate catastrophic forgetting (CF) by generating samples from previous tasks and merging them with current training data.
We evaluate GMR on multiple image datasets, which are divided into class-disjoint sub-tasks.
arXiv Detail & Related papers (2021-04-19T11:41:34Z) - HiPaR: Hierarchical Pattern-aided Regression [71.22664057305572]
HiPaR mines hybrid rules of the form $p Rightarrow y = f(X)$ where $p$ is the characterization of a data region and $f(X)$ is a linear regression model on a variable of interest $y$.
HiPaR relies on pattern mining techniques to identify regions of the data where the target variable can be accurately explained via local linear models.
As our experiments shows, HiPaR mines fewer rules than existing pattern-based regression methods while still attaining state-of-the-art prediction performance.
arXiv Detail & Related papers (2021-02-24T15:53:17Z) - Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models.
We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs.
Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z) - Toward a Generalization Metric for Deep Generative Models [18.941388632914666]
Generalization capacity of Deep Generative Models (DGMs) is difficult to measure.
We introduce a framework for comparing the robustness of evaluation metrics.
We develop an efficient method for estimating the complexity of Generative Latent Variable Models (GLVMs)
arXiv Detail & Related papers (2020-11-02T05:32:07Z) - Efficient MDI Adaptation for n-gram Language Models [25.67864542036985]
This paper presents an efficient algorithm for n-gram language model adaptation under the minimum discrimination information principle.
By taking advantage of the backoff structure of n-gram model and the idea of hierarchical training method, we show that MDI adaptation can be computed in linear-time complexity to the inputs in each iteration.
arXiv Detail & Related papers (2020-08-05T22:21:03Z) - Learning to Match Distributions for Domain Adaptation [116.14838935146004]
This paper proposes Learning to Match (L2M) to automatically learn the cross-domain distribution matching.
L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way.
Experiments on public datasets substantiate the superiority of L2M over SOTA methods.
arXiv Detail & Related papers (2020-07-17T03:26:13Z) - Joints in Random Forests [13.096855747795303]
Decision Trees (DTs) and Random Forests (RFs) are powerful discriminative learners and tools of central importance to the everyday machine learning practitioner and data scientist.
We show that DTs and RFs can naturally be interpreted as generative models, by drawing a connection to Probabilistic Circuits.
This reinterpretation equips them with a full joint distribution over the feature space and leads to Generative Decision Trees (GeDTs) and Generative Forests (GeFs)
arXiv Detail & Related papers (2020-06-25T14:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.