Positive-Congruent Training: Towards Regression-Free Model Updates
- URL: http://arxiv.org/abs/2011.09161v3
- Date: Mon, 17 May 2021 20:10:51 GMT
- Title: Positive-Congruent Training: Towards Regression-Free Model Updates
- Authors: Sijie Yan, Yuanjun Xiong, Kaustav Kundu, Shuo Yang, Siqi Deng, Meng
Wang, Wei Xia, Stefano Soatto
- Abstract summary: In image classification, sample-wise inconsistencies appear as "negative flips"
A new model incorrectly predicts the output for a test sample that was correctly classified by the old (reference) model.
We propose a simple approach for PC training, Focal Distillation, which enforces congruence with the reference model.
- Score: 87.25247195148187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reducing inconsistencies in the behavior of different versions of an AI
system can be as important in practice as reducing its overall error. In image
classification, sample-wise inconsistencies appear as "negative flips": A new
model incorrectly predicts the output for a test sample that was correctly
classified by the old (reference) model. Positive-congruent (PC) training aims
at reducing error rate while at the same time reducing negative flips, thus
maximizing congruency with the reference model only on positive predictions,
unlike model distillation. We propose a simple approach for PC training, Focal
Distillation, which enforces congruence with the reference model by giving more
weights to samples that were correctly classified. We also found that, if the
reference model itself can be chosen as an ensemble of multiple deep neural
networks, negative flips can be further reduced without affecting the new
model's accuracy.
Related papers
- Analysis of Interpolating Regression Models and the Double Descent
Phenomenon [3.883460584034765]
It is commonly assumed that models which interpolate noisy training data are poor to generalize.
The best models obtained are overparametrized and the testing error exhibits the double descent behavior as the model order increases.
We derive a result based on the behavior of the smallest singular value of the regression matrix that explains the peak location and the double descent shape of the testing error as a function of model order.
arXiv Detail & Related papers (2023-04-17T09:44:33Z) - Backward Compatibility During Data Updates by Weight Interpolation [17.502410289568587]
We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI)
BCWI reduces negative flips without sacrificing the improved accuracy of the new model.
We also explore the use of importance weighting during and averaging the weights of multiple new models in order to further reduce negative flips.
arXiv Detail & Related papers (2023-01-25T12:23:10Z) - ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training [110.52785254565518]
Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy by forcing a new model to imitate the old models, or use ensembles.
We analyze the role of ensembles in reducing NFR and observe that they remove negative flips that are typically not close to the decision boundary.
We present a method, called Ensemble Logit Difference Inhibition (ELODI), to train a classification system that achieves paragon performance in both error rate and NFR.
arXiv Detail & Related papers (2022-05-12T17:59:56Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework.
We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z) - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing
Regressions In NLP Model Updates [68.09049111171862]
This work focuses on quantifying, reducing and analyzing regression errors in the NLP model updates.
We formulate the regression-free model updates into a constrained optimization problem.
We empirically analyze how model ensemble reduces regression.
arXiv Detail & Related papers (2021-05-07T03:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.