Related papers: Combating the Instability of Mutual Information-based Losses via Regularization

Combating the Instability of Mutual Information-based Losses via Regularization

URL: http://arxiv.org/abs/2011.07932v4
Date: Sat, 18 Jun 2022 04:01:51 GMT
Title: Combating the Instability of Mutual Information-based Losses via Regularization
Authors: Kwanghee Choi and Siyeong Lee
Abstract summary: We first identify the symptoms behind their instability. We mitigate both issues by adding a novel regularization term to the existing losses. We present a novel benchmark that evaluates MI-based losses on both the MI estimation power and its capability on the downstream tasks.
Score: 7.424262881242935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Notable progress has been made in numerous fields of machine learning based on neural network-driven mutual information (MI) bounds. However, utilizing the conventional MI-based losses is often challenging due to their practical and mathematical limitations. In this work, we first identify the symptoms behind their instability: (1) the neural network not converging even after the loss seemed to converge, and (2) saturating neural network outputs causing the loss to diverge. We mitigate both issues by adding a novel regularization term to the existing losses. We theoretically and experimentally demonstrate that added regularization stabilizes training. Finally, we present a novel benchmark that evaluates MI-based losses on both the MI estimation power and its capability on the downstream tasks, closely following the pre-existing supervised and contrastive learning settings. We evaluate six different MI-based losses and their regularized counterparts on multiple benchmarks to show that our approach is simple yet effective.

Related papers

Overcoming catastrophic forgetting in neural networks [0.0]
Catastrophic forgetting is the primary challenge that hinders continual learning.<n> Elastic Weight Consolidation is a regularization-based approach inspired by synaptic consolidation in biological neural systems.<n>Our results confirm what was shown in previous research, showing that EWC significantly reduces forgetting compared to naive training.
arXiv Detail & Related papers (2025-07-14T17:04:05Z)
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling [11.168336416219857]
Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates.<n>We show that this discrepancy is not fully explained by finite-width phenomena such as catapult effects.<n>We validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss.
arXiv Detail & Related papers (2025-05-28T15:40:48Z)
Enhancing the "Immunity" of Mixture-of-Experts Networks for Adversarial Defense [6.3712912872409415]
Recent studies have revealed the vulnerability of Deep Neural Networks (DNNs) to adversarial examples. We propose a novel adversarial defense method called "Immunity" based on a modified Mixture-of-Experts (MoE) architecture.
arXiv Detail & Related papers (2024-02-29T01:27:38Z)
Neuro-mimetic Task-free Unsupervised Online Learning with Continual Self-Organizing Maps [56.827895559823126]
Self-organizing map (SOM) is a neural model often used in clustering and dimensionality reduction. We propose a generalization of the SOM, the continual SOM, which is capable of online unsupervised learning under a low memory budget. Our results, on benchmarks including MNIST, Kuzushiji-MNIST, and Fashion-MNIST, show almost a two times increase in accuracy.
arXiv Detail & Related papers (2024-02-19T19:11:22Z)
Estimation of individual causal effects in network setup for multiple treatments [4.53340898566495]
We study the problem of estimation of Individual Treatment Effects (ITE) in the context of multiple treatments and observational data. We employ Graph Convolutional Networks (GCN) to learn a shared representation of the confounders. Our approach utilizes separate neural networks to infer potential outcomes for each treatment.
arXiv Detail & Related papers (2023-12-18T06:07:45Z)
A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning [129.63326990812234]
We propose a technique named data-dependent contraction to capture how modified losses handle different classes. On top of this technique, a fine-grained generalization bound is established for imbalanced learning, which helps reveal the mystery of re-weighting and logit-adjustment.
arXiv Detail & Related papers (2023-10-07T09:15:08Z)
SuSana Distancia is all you need: Enforcing class separability in metric learning via two novel distance-based loss functions for few-shot image classification [0.9236074230806579]
We propose two loss functions which consider the importance of the embedding vectors by looking at the intra-class and inter-class distance between the few data. Our results show a significant improvement in accuracy in the miniImagenNet benchmark compared to other metric-based few-shot learning methods by a margin of 2%.
arXiv Detail & Related papers (2023-05-15T23:12:09Z)
Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training. We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z)
Reducing Catastrophic Forgetting in Self Organizing Maps with Internally-Induced Generative Replay [67.50637511633212]
A lifelong learning agent is able to continually learn from potentially infinite streams of pattern sensory data. One major historic difficulty in building agents that adapt is that neural systems struggle to retain previously-acquired knowledge when learning from new samples. This problem is known as catastrophic forgetting (interference) and remains an unsolved problem in the domain of machine learning to this day.
arXiv Detail & Related papers (2021-12-09T07:11:14Z)
On the Generalization Properties of Adversarial Training [21.79888306754263]
This paper studies the generalization performance of a generic adversarial training algorithm. A series of numerical studies are conducted to demonstrate how the smoothness and L1 penalization help improve the adversarial robustness of models.
arXiv Detail & Related papers (2020-08-15T02:32:09Z)
Vulnerability Under Adversarial Machine Learning: Bias or Variance? [77.30759061082085]
We investigate the effect of adversarial machine learning on the bias and variance of a trained deep neural network. Our analysis sheds light on why the deep neural networks have poor performance under adversarial perturbation. We introduce a new adversarial machine learning algorithm with lower computational complexity than well-known adversarial machine learning strategies.
arXiv Detail & Related papers (2020-08-01T00:58:54Z)
Optimization and Generalization of Regularization-Based Continual Learning: a Loss Approximation Viewpoint [35.5156045701898]
We provide a novel viewpoint of regularization-based continual learning by formulating it as a second-order Taylor approximation of the loss function of each task. Based on this viewpoint, we study the optimization aspects (i.e., convergence) as well as generalization properties (i.e., finite-sample guarantees) of regularization-based continual learning.
arXiv Detail & Related papers (2020-06-19T06:08:40Z)
Feature Purification: How Adversarial Training Performs Robust Deep Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.