Related papers: Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

URL: http://arxiv.org/abs/2504.08638v1
Date: Fri, 11 Apr 2025 15:39:44 GMT
Title: Transformer Learns Optimal Variable Selection in Group-Sparse Classification
Authors: Chenyang Zhang, Xuran Meng, Yuan Cao,
Abstract summary: We give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity"<n>We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables.<n>We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples.
Score: 14.760685658938787
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.

Related papers

Learning Spectral Methods by Transformers [18.869174453242383]
We show that multi-layered Transformers, given a sufficiently large set of pre-training instances, are able to learn the algorithms themselves. This learning paradigm is distinct from the in-context learning setup and is similar to the learning procedure of human brains.
arXiv Detail & Related papers (2025-01-02T15:53:25Z)
One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [48.4979348643494]
We study the capability of one-layer transformers learning the one-nearest neighbor rule. A single softmax attention layer can successfully learn to behave like a one-nearest neighbor.
arXiv Detail & Related papers (2024-11-16T16:12:42Z)
Adversarial Robustness of In-Context Learning in Transformers for Linear Regression [23.737606860443705]
This work investigates the vulnerability of in-context learning in transformers to textithijacking attacks focusing on the setting of linear regression tasks. We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions. We then demonstrate that adversarial training enhances transformers' robustness against hijacking attacks, even when just applied during finetuning.
arXiv Detail & Related papers (2024-11-07T21:25:58Z)
On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures [20.980349268151546]
This work aims to theoretically study the training dynamics of transformers for in-context classification tasks.<n>We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate.
arXiv Detail & Related papers (2024-10-15T16:57:14Z)
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context [25.360386832940875]
We show that when linear transformers are pre-trained on random instances for linear regression tasks, they make predictions using an algorithm similar to that of ordinary least squares.<n>In some settings, these trained transformers can exhibit "benign overfitting in-context"
arXiv Detail & Related papers (2024-10-02T17:30:21Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL) We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.