Nonparametric Teaching of Attention Learners
- URL: http://arxiv.org/abs/2602.20461v1
- Date: Tue, 24 Feb 2026 01:42:48 GMT
- Title: Nonparametric Teaching of Attention Learners
- Authors: Chen Zhang, Jianghui Wang, Bingyang Cheng, Zhongtao Chen, Wendong XU, Cong Wang, Marco Canini, Francesco Orabona, Yik Chung WU, Ngai Wong,
- Abstract summary: We present a novel paradigm named Attention Neural Teaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective.<n>Specifically, we observe training time reductions of 13.01% for LLMs and 20.58% for ViTs, spanning both fine-tuning and training-from-scratch regimes.
- Score: 37.60057002655994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention learners, neural networks built on the attention mechanism, e.g., transformers, excel at learning the implicit relationships that relate sequences to their corresponding properties, e.g., mapping a given sequence of tokens to the probability of the next token. However, the learning process tends to be costly. To address this, we present a novel paradigm named Attention Neural Teaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective. Specifically, the latter provides a theoretical framework for teaching mappings that are implicitly defined (i.e., nonparametric) via example selection. Such an implicit mapping is embodied through a dense set of sequence-property pairs, with the AtteNT teacher selecting a subset to accelerate convergence in attention learner training. By analytically investigating the role of attention on parameter-based gradient descent during training, and recasting the evolution of attention learners, shaped by parameter updates, through functional gradient descent in nonparametric teaching, we show for the first time that teaching attention learners is consistent with teaching importance-adaptive nonparametric learners. These new findings readily commit AtteNT to enhancing learning efficiency of attention learners. Specifically, we observe training time reductions of 13.01% for LLMs and 20.58% for ViTs, spanning both fine-tuning and training-from-scratch regimes. Crucially, these gains are achieved without compromising accuracy; in fact, performance is consistently preserved and often enhanced across a diverse set of downstream tasks.
Related papers
- Spectral Imbalance Causes Forgetting in Low-Rank Continual Adaptation [58.3773038915023]
Continual learning aims to adapt pre-trained models to sequential tasks without forgetting previously acquired knowledge.<n>Most existing approaches treat continual learning as avoiding interference with past updates, rather than considering what properties make the current task-specific update naturally preserve previously acquired knowledge.<n>We address this problem using a projected first-order method compatible with standard deep-dots used in vision-language models.
arXiv Detail & Related papers (2026-01-31T13:27:02Z) - EKPC: Elastic Knowledge Preservation and Compensation for Class-Incremental Learning [53.88000987041739]
Class-Incremental Learning (CIL) aims to enable AI models to continuously learn from sequentially arriving data of different classes over time.<n>We propose the Elastic Knowledge Preservation and Compensation (EKPC) method, integrating Importance-aware importance Regularization (IPR) and Trainable Semantic Drift Compensation (TSDC) for CIL.
arXiv Detail & Related papers (2025-06-14T05:19:58Z) - Nonparametric Teaching for Graph Property Learners [21.96981353343662]
We propose a paradigm called Graph Neural Teaching (GraNT) that reinterprets the learning process through a novel nonparametric teaching perspective.<n>GraNT offers a theoretical framework for teaching implicitly defined (i.e., nonparametric) mappings via example selection.<n>We show for the first time that teaching graph property learners is consistent with teaching structure-aware nonparametric learners.
arXiv Detail & Related papers (2025-05-20T10:23:30Z) - On the Surprising Effectiveness of Attention Transfer for Vision Transformers [118.83572030360843]
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations.
We investigate this question and find that the features and representations learned during pre-training are not essential.
arXiv Detail & Related papers (2024-11-14T18:59:40Z) - Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron [3.069335774032178]
We use a dataset-process approach to derive flow equations describing learning.<n>We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron's learning curve.<n>This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.
arXiv Detail & Related papers (2024-09-05T17:58:28Z) - Learning Continually by Spectral Regularization [45.55508032009977]
Continual learning algorithms seek to mitigate loss of plasticity by sustaining good performance while maintaining network trainability.
We develop a new technique for improving continual learning inspired by the observation that the singular values of the neural network parameters at initialization are an important factor for trainability during early phases of learning.
We present an experimental analysis that shows how the proposed spectral regularizer can sustain trainability and performance across a range of model architectures in continual supervised and reinforcement learning settings.
arXiv Detail & Related papers (2024-06-10T21:34:43Z) - Nonparametric Teaching of Implicit Neural Representations [21.313485818701434]
We show for the first time that an overparametricized multilayer perceptron is consistent with teaching a nonparametric learner.
This new discovery permits a convenient drop-in of nonparametric teaching algorithms to broadly enhance INR training efficiency, demonstrating 30%+ training time savings across various input modalities.
arXiv Detail & Related papers (2024-05-17T04:20:39Z) - Nonparametric Teaching for Multiple Learners [20.75580803325611]
We introduce a novel framework -- Multi-learner Nonparametric Teaching (MINT)
MINT aims to instruct multiple learners, with each learner focusing on learning a scalar-valued target model.
We demonstrate that MINT offers significant teaching speed-up over repeated single-learner teaching.
arXiv Detail & Related papers (2023-11-17T04:04:11Z) - TOAST: Transfer Learning via Attention Steering [77.83191769502763]
Current transfer learning methods often fail to focus on task-relevant features.
We introduce Top-Down Attention Steering (TOAST), a novel transfer learning algorithm that steers the attention to task-specific features.
TOAST substantially improves performance across a range of fine-grained visual classification datasets.
arXiv Detail & Related papers (2023-05-24T20:03:04Z) - HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised
Learning of Actions [69.14257241250046]
We propose a new contrastive learning approach to train models for skeleton-based action recognition without labels.
Our key contribution is a simple module, HaLP - to Hallucinate Latent Positives for contrastive learning.
We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements.
arXiv Detail & Related papers (2023-04-01T21:09:43Z) - A Message Passing Perspective on Learning Dynamics of Contrastive
Learning [60.217972614379065]
We show that if we cast a contrastive objective equivalently into the feature space, then its learning dynamics admits an interpretable form.
This perspective also establishes an intriguing connection between contrastive learning and Message Passing Graph Neural Networks (MP-GNNs)
arXiv Detail & Related papers (2023-03-08T08:27:31Z) - Toward Understanding the Feature Learning Process of Self-supervised
Contrastive Learning [43.504548777955854]
We study how contrastive learning learns the feature representations for neural networks by analyzing its feature learning process.
We prove that contrastive learning using textbfReLU networks provably learns the desired sparse features if proper augmentations are adopted.
arXiv Detail & Related papers (2021-05-31T16:42:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.