Related papers: Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L

Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior

URL: http://arxiv.org/abs/2510.16356v1
Date: Sat, 18 Oct 2025 05:26:13 GMT
Title: Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior
Authors: Fuqun Han, Stanley Osher, Wuchen Li,
Abstract summary: We propose a sparse transformer architecture that incorporates prior information about the underlying data distribution directly into the transformer structure of the neural network.<n>We demonstrate that the sparse transformer achieves higher accuracy and faster convergence to the target distribution than classical neural ODE-based methods.
Score: 0.49193859756091124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we propose a sparse transformer architecture that incorporates prior information about the underlying data distribution directly into the transformer structure of the neural network. The design of the model is motivated by a special optimal transport problem, namely the regularized Wasserstein proximal operator, which admits a closed-form solution and turns out to be a special representation of transformer architectures. Compared with classical flow-based models, the proposed approach improves the convexity properties of the optimization problem and promotes sparsity in the generated samples. Through both theoretical analysis and numerical experiments, including applications in generative modeling and Bayesian inverse problems, we demonstrate that the sparse transformer achieves higher accuracy and faster convergence to the target distribution than classical neural ODE-based methods.

Related papers

Order-Optimal Sample Complexity of Rectified Flows [43.61958734990224]
We study rectified flow models, which constrain transport trajectories to be linear from the base distribution to the data distribution.<n>This structural restriction greatly accelerates sampling, often enabling high-quality generation with a single step.
arXiv Detail & Related papers (2026-01-28T04:55:14Z)
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization [52.77441224845925]
Quantization to low bitwidth is a standard approach for deploying large language models.<n>A few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer.<n>We derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization.
arXiv Detail & Related papers (2025-11-30T16:17:34Z)
Neural Optimal Transport Meets Multivariate Conformal Prediction [58.43397908730771]
We propose a framework for conditional vectorile regression (CVQR)<n>CVQR combines neural optimal transport with quantized optimization, and apply it to predictions.
arXiv Detail & Related papers (2025-09-29T19:50:19Z)
A surrogate model for topology optimisation of elastic structures via parametric autoencoders [0.0]
Instead of learning the parametric solution of the state (and adjoint) problems, the proposed approach devises a surrogate version of the entire optimisation pipeline.<n>The method predicts a quasi-optimal topology for a given problem configuration as a surrogate model of high-fidelity topologies optimised with the homogenisation method.<n>Different architectures are proposed and the approximation and generalisation capabilities of the resulting models are numerically evaluated.
arXiv Detail & Related papers (2025-07-30T10:07:42Z)
Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures [1.9567015559455132]
We present a framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM)<n>Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture.
arXiv Detail & Related papers (2025-05-01T19:19:29Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Approximation Rate of the Transformer Architecture for Sequence Modeling [18.166959969957315]
We consider a class of non-linear relationships and identify a novel notion of complexity measures to establish an explicit Jackson-type approximation rate estimate for the Transformer.<n>This rate reveals the structural properties of the Transformer and suggests the types of sequential relationships it is best suited for approximating.
arXiv Detail & Related papers (2023-05-29T10:56:36Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables. The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning. We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.