Explaining Neural Matrix Factorization with Gradient Rollback
- URL: http://arxiv.org/abs/2010.05516v4
- Date: Tue, 15 Dec 2020 07:01:30 GMT
- Title: Explaining Neural Matrix Factorization with Gradient Rollback
- Authors: Carolin Lawrence, Timo Sztyler, Mathias Niepert
- Abstract summary: gradient rollback is a general approach for influence estimation.
We show that gradient rollback is highly efficient at both training and test time.
gradient rollback provides faithful explanations for knowledge base completion and recommender datasets.
- Score: 22.33402175974514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explaining the predictions of neural black-box models is an important
problem, especially when such models are used in applications where user trust
is crucial. Estimating the influence of training examples on a learned neural
model's behavior allows us to identify training examples most responsible for a
given prediction and, therefore, to faithfully explain the output of a
black-box model. The most generally applicable existing method is based on
influence functions, which scale poorly for larger sample sizes and models.
We propose gradient rollback, a general approach for influence estimation,
applicable to neural models where each parameter update step during gradient
descent touches a smaller number of parameters, even if the overall number of
parameters is large. Neural matrix factorization models trained with gradient
descent are part of this model class. These models are popular and have found a
wide range of applications in industry. Especially knowledge graph embedding
methods, which belong to this class, are used extensively. We show that
gradient rollback is highly efficient at both training and test time. Moreover,
we show theoretically that the difference between gradient rollback's influence
approximation and the true influence on a model's behavior is smaller than
known bounds on the stability of stochastic gradient descent. This establishes
that gradient rollback is robustly estimating example influence. We also
conduct experiments which show that gradient rollback provides faithful
explanations for knowledge base completion and recommender datasets.
Related papers
- Precise asymptotic analysis of Sobolev training for random feature models [0.0]
We study the impact of Sobolev training -- regression with both function and gradient data -- on the generalization error of predictive models in high dimensions.<n>Our results identify settings where models perform optimally by interpolating noisy function and gradient data.
arXiv Detail & Related papers (2025-11-04T22:49:33Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models [36.05242956018461]
In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection.
We first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets.
We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models.
arXiv Detail & Related papers (2024-05-06T21:34:46Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Analysis of Interpolating Regression Models and the Double Descent
Phenomenon [3.883460584034765]
It is commonly assumed that models which interpolate noisy training data are poor to generalize.
The best models obtained are overparametrized and the testing error exhibits the double descent behavior as the model order increases.
We derive a result based on the behavior of the smallest singular value of the regression matrix that explains the peak location and the double descent shape of the testing error as a function of model order.
arXiv Detail & Related papers (2023-04-17T09:44:33Z) - First is Better Than Last for Language Data Influence [44.907420330002815]
We show that TracIn-WE significantly outperforms other data influence methods applied on the last layer.
We also show that TracIn-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input.
arXiv Detail & Related papers (2022-02-24T00:48:29Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Provable Benefits of Overparameterization in Model Compression: From
Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models.
This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning.
We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.