Related papers: Forget Me Not: Fighting Local Overfitting with Knowledge Fusion and Distillation

Forget Me Not: Fighting Local Overfitting with Knowledge Fusion and Distillation

URL: http://arxiv.org/abs/2507.08686v1
Date: Fri, 11 Jul 2025 15:37:24 GMT
Title: Forget Me Not: Fighting Local Overfitting with Knowledge Fusion and Distillation
Authors: Uri Stern, Eli Corn, Daphna Weinshall,
Abstract summary: We introduce a novel score that measures the forgetting rate of deep models on validation data.<n>We demonstrate that local overfitting can arise even without conventional overfitting.<n>We then introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge.
Score: 6.7864586321550595
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting -- yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method -- Knowledge Fusion followed by Knowledge Distillation -- outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity.

Related papers

A Classical View on Benign Overfitting: The Role of Sample Size [14.36840959836957]
We focus on almost benign overfitting, where models simultaneously achieve both arbitrarily small training and test errors.<n>This behavior is characteristic of neural networks, which often achieve low (but non-zero) training error while still generalizing well.
arXiv Detail & Related papers (2025-05-16T18:37:51Z)
On Local Overfitting and Forgetting in Deep Neural Networks [6.7864586321550595]
We propose a novel score that captures the forgetting rate of deep models on validation data.<n>We show that local overfitting occurs regardless of the presence of traditional overfitting.<n>We devise a new ensemble method that aims to recover forgotten knowledge, relying solely on the training history of a single network.
arXiv Detail & Related papers (2024-12-17T14:53:38Z)
Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z)
Visual Self-paced Iterative Learning for Unsupervised Temporal Action Localization [50.48350210022611]
We present a novel self-paced iterative learning model to enhance clustering and localization training simultaneously.<n>We design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor [18.696420390977863]
This paper systematically analyzes the forgetting degree of each class during local training across different communication rounds. Motivated by these findings, we propose a novel and straightforward algorithm called Federated Knowledge Anchor (FedKA)
arXiv Detail & Related papers (2023-12-05T01:12:56Z)
Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other. We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z)
Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free Ensembles of DNNs [9.010643838773477]
We introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data. We show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated. We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement without any additional cost in training time.
arXiv Detail & Related papers (2023-10-17T09:22:22Z)
United We Stand: Using Epoch-wise Agreement of Ensembles to Combat Overfit [7.627299398469962]
We introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting. Our method allows for the incorporation of useful knowledge obtained during the overfitting phase without deterioration of the general performance. Our method is easy to implement and can be integrated with any training scheme and architecture.
arXiv Detail & Related papers (2023-10-17T08:51:44Z)
Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models. It is most prominently used in federated learning. We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Certified Robustness in Federated Learning [54.03574895808258]
We study the interplay between federated training, personalization, and certified robustness. We find that the simple federated averaging technique is effective in building not only more accurate, but also more certifiably-robust models.
arXiv Detail & Related papers (2022-06-06T12:10:53Z)
Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.