Related papers: Bayesian Self-Distillation for Image Classification

Bayesian Self-Distillation for Image Classification

URL: http://arxiv.org/abs/2512.24162v1
Date: Tue, 30 Dec 2025 11:48:06 GMT
Title: Bayesian Self-Distillation for Image Classification
Authors: Anton Adelöw, Matteo Gamba, Atsuto Maki,
Abstract summary: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, robustness, and robustness.<n>Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness.<n>We propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions.<n>BSD consistently yields higher Expected Error (ECE) (-40%) than existing architecture-preserving self-
Score: 6.446179861303341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.

Related papers

CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation [71.52209438343928]
Core Distribution Alignment (CoDA) is a framework that enables effective Distillation (DD) using only an off-the-shelf text-to-image model.<n>Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism.<n>By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics.
arXiv Detail & Related papers (2025-12-03T14:45:57Z)
BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition [78.70453964041718]
Current deep learning algorithms usually solve for the optimal classifier by emphimplicitly estimating the posterior probabilities.<n>This simple methodology has been proven effective for meticulously balanced academic benchmark datasets.<n>However, it is not applicable to the long-tailed data distributions in the real world.<n>This paper presents a novel approach (BAPE) that provides a more precise theoretical estimation of the data distributions.
arXiv Detail & Related papers (2025-06-29T15:12:50Z)
Credal Wrapper of Model Averaging for Uncertainty Estimation in Classification [5.19656787424626]
This paper presents an innovative approach, called credal wrapper, to formulating a credal set representation of model averaging for Bayesian neural networks (BNNs) and deep ensembles (DEs)<n>Given a finite collection of single predictive distributions derived from BNNs or DEs, the proposed credal wrapper approach extracts an upper and a lower probability bound per class.<n>Compared to the BNN and DE baselines, the proposed credal wrapper method exhibits superior performance in uncertainty estimation and achieves a lower expected calibration error on corrupted data.
arXiv Detail & Related papers (2024-05-23T20:51:22Z)
Certified PEFTSmoothing: Parameter-Efficient Fine-Tuning with Randomized Smoothing [6.86204821852287]
Randomized smoothing is the primary certified robustness method for accessing the robustness of deep learning models to adversarial perturbations in the l2-norm. A notable constraint limiting widespread adoption is the necessity to retrain base models entirely from scratch to attain a robust version. This is because the base model fails to learn the noise-augmented data distribution to give an accurate vote. Inspired by recent large model training procedures, we explore an alternative way named PEFTSmoothing to adapt the base model to learn the noise-augmented data.
arXiv Detail & Related papers (2024-04-08T09:38:22Z)
Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing [9.637143119088426]
We show that a robust base classifier's confidence difference for correct and incorrect examples is the key to this improvement. We adapt an adversarial input detector into a mixing network that adaptively adjusts the mixture of the two base models. The proposed flexible method, termed "adaptive smoothing", can work in conjunction with existing or even future methods that improve clean accuracy, robustness, or adversary detection.
arXiv Detail & Related papers (2023-01-29T22:05:28Z)
Source-Free Progressive Graph Learning for Open-Set Domain Adaptation [44.63301903324783]
Open-set domain adaptation (OSDA) has gained considerable attention in many visual recognition tasks. We propose a Progressive Graph Learning (PGL) framework that decomposes the target hypothesis space into the shared and unknown subspaces. We also tackle a more realistic source-free open-set domain adaptation (SF-OSDA) setting that makes no assumption about the coexistence of source and target domains.
arXiv Detail & Related papers (2022-02-13T01:19:41Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection [85.53263670166304]
One-stage detector basically formulates object detection as dense classification and localization. Recent trend for one-stage detectors is to introduce an individual prediction branch to estimate the quality of localization. This paper delves into the representations of the above three fundamental elements: quality estimation, classification and localization.
arXiv Detail & Related papers (2020-06-08T07:24:33Z)
Uncertainty Estimation Using a Single Deep Deterministic Neural Network [66.26231423824089]
We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass. We scale training in these with a novel loss function and centroid updating scheme and match the accuracy of softmax models.
arXiv Detail & Related papers (2020-03-04T12:27:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.