Related papers: From Channel Bias to Feature Redundancy: Uncovering the "Less is More" Principle in Few-Shot Learning

From Channel Bias to Feature Redundancy: Uncovering the "Less is More" Principle in Few-Shot Learning

URL: http://arxiv.org/abs/2310.03843v2
Date: Wed, 10 Sep 2025 10:53:27 GMT
Title: From Channel Bias to Feature Redundancy: Uncovering the "Less is More" Principle in Few-Shot Learning
Authors: Ji Zhang, Xu Luo, Lianli Gao, Difan Zou, Hengtao Shen, Jingkuan Song,
Abstract summary: Deep neural networks often fail to adapt representations to novel tasks under distribution shifts.<n>This paper identifies a core obstacle behind this failure: channel bias.<n>We show that for few-shot tasks, classification accuracy is significantly improved by using as few as 1-5% of the most discriminative feature dimensions.
Score: 138.06600932634896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks often fail to adapt representations to novel tasks under distribution shifts, especially when only a few examples are available. This paper identifies a core obstacle behind this failure: channel bias, where networks develop a rigid emphasis on feature dimensions that were discriminative for the source task, but this emphasis is misaligned and fails to adapt to the distinct needs of a novel task. This bias leads to a striking and detrimental consequence: feature redundancy. We demonstrate that for few-shot tasks, classification accuracy is significantly improved by using as few as 1-5% of the most discriminative feature dimensions, revealing that the vast majority are actively harmful. Our theoretical analysis confirms that this redundancy originates from confounding feature dimensions-those with high intra-class variance but low inter-class separability-which are especially problematic in low-data regimes. This "less is more" phenomenon is a defining characteristic of the few-shot setting, diminishing as more samples become available. To address this, we propose a simple yet effective soft-masking method, Augmented Feature Importance Adjustment (AFIA), which estimates feature importance from augmented data to mitigate the issue. By establishing the cohesive link from channel bias to its consequence of extreme feature redundancy, this work provides a foundational principle for few-shot representation transfer and a practical method for developing more robust few-shot learning algorithms.

Related papers

These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining [10.749875317643031]
Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data.<n>We evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training.<n>We identify a fundamental limitation in deep learning models, where networks fail to learn new features once they encode similar competing features during training.
arXiv Detail & Related papers (2025-06-23T01:04:29Z)
Elastic Representation: Mitigating Spurious Correlations for Group Robustness [24.087096334524077]
Deep learning models can suffer from severe performance degradation when relying on spurious correlations between input features and labels.<n>We propose Elastic Representation (ElRep) to learn features by imposing Nuclear- and Frobenius-norm penalties on the representation from the last layer of a neural network.
arXiv Detail & Related papers (2025-02-14T01:25:27Z)
Robust Representation Consistency Model via Contrastive Denoising [83.47584074390842]
randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations.<n> diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples.<n>We reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space.
arXiv Detail & Related papers (2025-01-22T18:52:06Z)
On the Connection between Pre-training Data Diversity and Fine-tuning Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z)
Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks. This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z)
The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks [64.12052498909105]
We study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. In two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples.
arXiv Detail & Related papers (2023-03-02T18:14:35Z)
Optimal transfer protocol by incremental layer defrosting [66.76153955485584]
Transfer learning is a powerful tool enabling model training with limited amounts of data. The simplest transfer learning protocol is based on freezing" the feature-extractor layers of a network pre-trained on a data-rich source task. We show that this protocol is often sub-optimal and the largest performance gain may be achieved when smaller portions of the pre-trained network are kept frozen.
arXiv Detail & Related papers (2023-03-02T17:32:11Z)
Overcoming Simplicity Bias in Deep Networks using a Feature Sieve [5.33024001730262]
We propose a direct, interventional method for addressing simplicity bias in deep networks. We aim to automatically identify and suppress easily-computable spurious features in lower layers of the network. We report substantial gains on many real-world debiasing benchmarks.
arXiv Detail & Related papers (2023-01-30T21:11:13Z)
On Measuring the Intrinsic Few-Shot Hardness of Datasets [49.37562545777455]
We show that few-shot hardness may be intrinsic to datasets, for a given pre-trained model. We propose a simple and lightweight metric called "Spread" that captures the intuition that few-shot learning is made possible. Our metric better accounts for few-shot hardness compared to existing notions of hardness, and is 8-100x faster to compute.
arXiv Detail & Related papers (2022-11-16T18:53:52Z)
Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in Neural Networks [66.76034024335833]
We investigate why diverse/ complex features are learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features. We propose Feature Reconstruction Regularizer (FRR) to ensure that the learned features can be reconstructed back from the logits. We demonstrate up to 15% gains in OOD accuracy on the recently introduced semi-synthetic datasets with extreme distribution shifts.
arXiv Detail & Related papers (2022-10-04T04:01:15Z)
Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible. For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance. It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z)
Revisiting the Updates of a Pre-trained Model for Few-shot Learning [11.871523410051527]
We compare the two popular updating methods, fine-tuning and linear probing. We find that fine-tuning is better than linear probing as the number of samples increases.
arXiv Detail & Related papers (2022-05-13T08:47:06Z)
GDC- Generalized Distribution Calibration for Few-Shot Learning [5.076419064097734]
Few shot learning is an important problem in machine learning as large labelled datasets take considerable time and effort to assemble. Most few-shot learning algorithms suffer from one of two limitations- they either require the design of sophisticated models and loss functions, thus hampering interpretability. We propose a Generalized sampling method that learns to estimate few-shot distributions for classification as weighted random variables of all large classes.
arXiv Detail & Related papers (2022-04-11T16:22:53Z)
High-Robustness, Low-Transferability Fingerprinting of Neural Networks [78.2527498858308]
This paper proposes Characteristic Examples for effectively fingerprinting deep neural networks. It features high-robustness to the base model against model pruning as well as low-transferability to unassociated models.
arXiv Detail & Related papers (2021-05-14T21:48:23Z)
Why Do Better Loss Functions Lead to Less Transferable Features? [93.47297944685114]
This paper studies how the choice of training objective affects the transferability of the hidden representations of convolutional neural networks trained on ImageNet. We show that many objectives lead to statistically significant improvements in ImageNet accuracy over vanilla softmax cross-entropy, but the resulting fixed feature extractors transfer substantially worse to downstream tasks.
arXiv Detail & Related papers (2020-10-30T17:50:31Z)
Grow-Push-Prune: aligning deep discriminants for effective structural network compression [5.532477732693]
This paper attempts to derive task-dependent compact models from a deep discriminant analysis perspective. We propose an iterative and proactive approach for classification tasks which alternates between a pushing step and a pruning step. Experiments on the MNIST, CIFAR10, and ImageNet datasets demonstrate our approach's efficacy.
arXiv Detail & Related papers (2020-09-29T01:29:23Z)
Prevention is Better than Cure: Handling Basis Collapse and Transparency in Dense Networks [0.0]
We identify a basis collapse issue as a primary cause and propose a modified loss function that circumvents this problem. We demonstrate through carefully chosen numerical experiments that the basis collapse issue leads to the design of massively redundant networks. Our approach results in substantially concise nets, having $100 times$ fewer parameters, while achieving a much lower $(10times)$ MSE loss at scale than reported in prior works.
arXiv Detail & Related papers (2020-08-22T17:09:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.