A Constrained Optimization Perspective of Unrolled Transformers
- URL: http://arxiv.org/abs/2601.17257v1
- Date: Sat, 24 Jan 2026 02:12:39 GMT
- Title: A Constrained Optimization Perspective of Unrolled Transformers
- Authors: Javier Porras-Valenzuela, Samar Hadou, Alejandro Ribeiro,
- Abstract summary: We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms.<n>We observe constrained transformers achieve stronger to perturbations robustness and maintain higher out-of-distribution generalization.
- Score: 77.12297732942095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms. Specifically, we enforce layerwise descent constraints on the objective function and replace standard empirical risk minimization (ERM) with a primal-dual training scheme. This approach yields models whose intermediate representations decrease the loss monotonically in expectation across layers. We apply our method to both unrolled transformer architectures and conventional pretrained transformers on tasks of video denoising and text classification. Across these settings, we observe constrained transformers achieve stronger robustness to perturbations and maintain higher out-of-distribution generalization, while preserving in-distribution performance.
Related papers
- Variational Entropic Optimal Transport [67.76725267984578]
We propose Variational Entropic Optimal Transport (VarEOT) for domain translation problems.<n>VarEOT is based on an exact variational reformulation of the log-partition $log mathbbE[exp(cdot)$ as a tractable generalization over an auxiliary positive normalizer.<n> Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality.
arXiv Detail & Related papers (2026-02-02T15:48:44Z) - Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification [5.311735227179715]
We propose a low-rank adaptation method for training privacy-preserving vision transformer (ViT) models that efficiently freezes pre-trained ViT model weights.<n>The proposed method allows us not only to reduce the number of trainable parameters but to also maintain almost the same accuracy as that of full-time tuning.
arXiv Detail & Related papers (2025-07-16T06:18:52Z) - OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization [1.7180235064112577]
We consider a dynamical system whose governing equation is parametrized by transformer blocks.<n>We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model.
arXiv Detail & Related papers (2025-01-30T22:52:40Z) - Transformers Handle Endogeneity in In-Context Linear Regression [34.458004744956334]
We show that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV)<n>We propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss.
arXiv Detail & Related papers (2024-10-02T06:21:04Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - Optimizing a Transformer-based network for a deep learning seismic
processing workflow [0.0]
StorSeismic is a recently introduced model based on the Transformer to adapt to various seismic processing tasks.
We observe faster pretraining and competitive results on the fine-tuning tasks and, additionally, fewer parameters to train compared to the vanilla model.
arXiv Detail & Related papers (2023-08-09T07:11:42Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.