Decoupled Kullback-Leibler Divergence Loss
- URL: http://arxiv.org/abs/2305.13948v1
- Date: Tue, 23 May 2023 11:17:45 GMT
- Title: Decoupled Kullback-Leibler Divergence Loss
- Authors: Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu,
Hanwang Zhang
- Abstract summary: Kullback-Leibler (KL) Divergence loss is equivalent to the Doupled Kullback-Leibler (DKL) Divergence loss.
We introduce global information into DKL for intra-class consistency regularization.
The proposed approach achieves new state-of-the-art performance on both tasks, demonstrating the substantial practical merits.
- Score: 75.31157286595517
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss
and observe that it is equivalent to the Doupled Kullback-Leibler (DKL)
Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss
and 2) a Cross-Entropy loss incorporating soft labels. From our analysis of the
DKL loss, we have identified two areas for improvement. Firstly, we address the
limitation of DKL in scenarios like knowledge distillation by breaking its
asymmetry property in training optimization. This modification ensures that the
wMSE component is always effective during training, providing extra
constructive cues. Secondly, we introduce global information into DKL for
intra-class consistency regularization. With these two enhancements, we derive
the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its
effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets,
focusing on adversarial training and knowledge distillation tasks. The proposed
approach achieves new state-of-the-art performance on both tasks, demonstrating
the substantial practical merits. Code and models will be available soon at
https://github.com/jiequancui/DKL.
Related papers
- Generalized Kullback-Leibler Divergence Loss [105.66549870868971]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss.
Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement.
arXiv Detail & Related papers (2025-03-11T04:43:33Z) - Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.
Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.
We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z) - On the Power of Perturbation under Sampling in Solving Extensive-Form Games [56.013335390600524]
We investigate how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in solving extensive-form games under sampling.<n>We present a unified framework for textitPerturbed FTRL algorithms and study two variants: PFTRL-KL and PFTRL-RKL.
arXiv Detail & Related papers (2025-01-28T00:29:38Z) - Kendall's $τ$ Coefficient for Logits Distillation [33.77389987117822]
We propose a ranking loss based on Kendall's $tau$ coefficient, called Rank-Kendall Knowledge Distillation (RKKD)
RKKD balances the attention to smaller-valued channels by constraining the order of channel values in student logits.
Our experiments show that our RKKD can enhance the performance of various knowledge distillation baselines and offer broad improvements across multiple teacher-student architecture combinations.
arXiv Detail & Related papers (2024-09-26T13:21:02Z) - A Unified Contrastive Loss for Self-Training [3.3454373538792552]
Self-training methods have proven to be effective in exploiting abundant unlabeled data in semi-supervised learning.
We propose a general framework to enhance self-training methods, which replaces all instances of CE losses with a unique contrastive loss.
Our framework results in significant performance improvements across three different datasets with a limited number of labeled data.
arXiv Detail & Related papers (2024-09-11T14:22:41Z) - EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification [1.3778851745408134]
We propose a novel ensemble method, namely EnsLoss, to combine loss functions within the Empirical risk minimization framework.
We first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions.
We theoretically establish the statistical consistency of our approach and provide insights into its benefits.
arXiv Detail & Related papers (2024-09-02T02:40:42Z) - Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs)
In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation.
We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z) - Mitigating Privacy Risk in Membership Inference by Convex-Concave Loss [16.399746814823025]
Machine learning models are susceptible to membership inference attacks (MIAs), which aim to infer whether a sample is in the training set.
Existing work utilizes gradient ascent to enlarge the loss variance of training data, alleviating the privacy risk.
We propose a novel method -- Convex-Concave Loss, which enables a high variance of training loss distribution by gradient descent.
arXiv Detail & Related papers (2024-02-08T07:14:17Z) - FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated
Learning [66.56240101249803]
We study how hardening benign clients can affect the global model (and the malicious clients)
We propose a trigger reverse engineering based defense and show that our method can achieve improvement with guarantee robustness.
Our results on eight competing SOTA defense methods show the empirical superiority of our method on both single-shot and continuous FL backdoor attacks.
arXiv Detail & Related papers (2022-10-23T22:24:03Z) - The KFIoU Loss for Rotated Object Detection [115.334070064346]
In this paper, we argue that one effective alternative is to devise an approximate loss who can achieve trend-level alignment with SkewIoU loss.
Specifically, we model the objects as Gaussian distribution and adopt Kalman filter to inherently mimic the mechanism of SkewIoU.
The resulting new loss called KFIoU is easier to implement and works better compared with exact SkewIoU.
arXiv Detail & Related papers (2022-01-29T10:54:57Z) - Label Distributionally Robust Losses for Multi-class Classification:
Consistency, Robustness and Adaptivity [55.29408396918968]
We study a family of loss functions named label-distributionally robust (LDR) losses for multi-class classification.
Our contributions include both consistency and robustness by establishing top-$k$ consistency of LDR losses for multi-class classification.
We propose a new adaptive LDR loss that automatically adapts the individualized temperature parameter to the noise degree of class label of each instance.
arXiv Detail & Related papers (2021-12-30T00:27:30Z) - Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in
Knowledge Distillation [9.157410884444312]
Knowledge distillation (KD) has been investigated to design efficient neural architectures.
We show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0.
We show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise.
arXiv Detail & Related papers (2021-05-19T04:40:53Z) - Semi-supervised Contrastive Learning with Similarity Co-calibration [72.38187308270135]
We propose a novel training strategy, termed as Semi-supervised Contrastive Learning (SsCL)
SsCL combines the well-known contrastive loss in self-supervised learning with the cross entropy loss in semi-supervised learning.
We show that SsCL produces more discriminative representation and is beneficial to few shot learning.
arXiv Detail & Related papers (2021-05-16T09:13:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.