Reliable and scalable variable importance estimation via warm-start and early stopping
- URL: http://arxiv.org/abs/2412.01120v1
- Date: Mon, 02 Dec 2024 04:45:10 GMT
- Title: Reliable and scalable variable importance estimation via warm-start and early stopping
- Authors: Zexuan Sun, Garvesh Raskutti,
- Abstract summary: We develop a scalable method to estimate variable importance for any algorithm that can be expressed as an iterative kernel update equation.
We provide theoretical guarantees by using the theory for early stopping of kernel-based methods for neural networks with sufficiently large (but not necessarily infinite) width and gradient-boosting decision trees.
- Score: 8.870380386952993
- License:
- Abstract: As opaque black-box predictive models become more prevalent, the need to develop interpretations for these models is of great interest. The concept of variable importance and Shapley values are interpretability measures that applies to any predictive model and assesses how much a variable or set of variables improves prediction performance. When the number of variables is large, estimating variable importance presents a significant computational challenge because re-training neural networks or other black-box algorithms requires significant additional computation. In this paper, we address this challenge for algorithms using gradient descent and gradient boosting (e.g. neural networks, gradient-boosted decision trees). By using the ideas of early stopping of gradient-based methods in combination with warm-start using the dropout method, we develop a scalable method to estimate variable importance for any algorithm that can be expressed as an iterative kernel update equation. Importantly, we provide theoretical guarantees by using the theory for early stopping of kernel-based methods for neural networks with sufficiently large (but not necessarily infinite) width and gradient-boosting decision trees that use symmetric trees as a weaker learner. We also demonstrate the efficacy of our methods through simulations and a real data example which illustrates the computational benefit of early stopping rather than fully re-training the model as well as the increased accuracy of our approach.
Related papers
- Predicting Probabilities of Error to Combine Quantization and Early Exiting: QuEE [68.6018458996143]
We propose a more general dynamic network that can combine both quantization and early exit dynamic network: QuEE.
Our algorithm can be seen as a form of soft early exiting or input-dependent compression.
The crucial factor of our approach is accurate prediction of the potential accuracy improvement achievable through further computation.
arXiv Detail & Related papers (2024-06-20T15:25:13Z) - Bayesian Deep Learning for Remaining Useful Life Estimation via Stein
Variational Gradient Descent [14.784809634505903]
We show that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance.
We propose a method to enhance performance based on the uncertainty information provided by the Bayesian models.
arXiv Detail & Related papers (2024-02-02T02:21:06Z) - Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels [57.46832672991433]
We propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS)
We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises.
We develop an expectation-propagation expectation-maximization algorithm for efficient posterior inference and function estimation.
arXiv Detail & Related papers (2023-10-09T03:55:09Z) - Generalizing Backpropagation for Gradient-Based Interpretability [103.2998254573497]
We show that the gradient of a model is a special case of a more general formulation using semirings.
This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics.
arXiv Detail & Related papers (2023-07-06T15:19:53Z) - Lazy Estimation of Variable Importance for Large Neural Networks [22.95405462638975]
We propose a fast and flexible method for approximating the reduced model with important inferential guarantees.
We demonstrate our method is fast and accurate under several data-generating regimes, and we demonstrate its real-world applicability on a seasonal climate forecasting example.
arXiv Detail & Related papers (2022-07-19T06:28:17Z) - Robust Learning of Parsimonious Deep Neural Networks [0.0]
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network.
We derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection.
We evaluate the proposed algorithm on the MNIST data set and commonly used fully connected and convolutional LeNet architectures.
arXiv Detail & Related papers (2022-05-10T03:38:55Z) - Scalable computation of prediction intervals for neural networks via
matrix sketching [79.44177623781043]
Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure.
This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals.
arXiv Detail & Related papers (2022-05-06T13:18:31Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Uncertainty in Gradient Boosting via Ensembles [37.808845398471874]
ensembles of gradient boosting models successfully detect anomalous inputs while having limited ability to improve the predicted total uncertainty.
We propose a concept of a virtual ensemble to get the benefits of an ensemble via only one gradient boosting model, which significantly reduces complexity.
arXiv Detail & Related papers (2020-06-18T14:11:27Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.