Related papers: Transformer Neural Autoregressive Flows

Transformer Neural Autoregressive Flows

URL: http://arxiv.org/abs/2401.01855v1
Date: Wed, 3 Jan 2024 17:51:16 GMT
Title: Transformer Neural Autoregressive Flows
Authors: Massimiliano Patacchiola, Aliaksandra Shysheya, Katja Hofmann, Richard E. Turner
Abstract summary: Density estimation can be performed using Normalizing Flows (NFs) We propose a novel solution by exploiting transformers to define a new class of neural flows called Transformer Neural Autoregressive Flows (T-NAFs)
Score: 48.68932811531102
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Density estimation, a central problem in machine learning, can be performed using Normalizing Flows (NFs). NFs comprise a sequence of invertible transformations, that turn a complex target distribution into a simple one, by exploiting the change of variables theorem. Neural Autoregressive Flows (NAFs) and Block Neural Autoregressive Flows (B-NAFs) are arguably the most perfomant members of the NF family. However, they suffer scalability issues and training instability due to the constraints imposed on the network structure. In this paper, we propose a novel solution to these challenges by exploiting transformers to define a new class of neural flows called Transformer Neural Autoregressive Flows (T-NAFs). T-NAFs treat each dimension of a random variable as a separate input token, using attention masking to enforce an autoregressive constraint. We take an amortization-inspired approach where the transformer outputs the parameters of an invertible transformation. The experimental results demonstrate that T-NAFs consistently match or outperform NAFs and B-NAFs across multiple datasets from the UCI benchmark. Remarkably, T-NAFs achieve these results using an order of magnitude fewer parameters than previous approaches, without composing multiple flows.

Related papers

Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks [0.0]
We show that constraining the final layer of a neural network to a simplex ETF can reduce the number of trainable parameters without sacrificing model accuracy. We propose two novel training approaches: Adaptive-ETF, a generalized framework that enforces simplex ETF constraints on all layers beyond the effective depth, and ETF-Transformer, which applies simplex ETF constraints to the feedforward layers within transformer blocks.
arXiv Detail & Related papers (2024-12-01T16:44:55Z)
Equivariant Neural Functional Networks for Transformers [2.3963215252605172]
This paper systematically explores neural functional networks (NFN) for transformer architectures. NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data.
arXiv Detail & Related papers (2024-10-05T15:56:57Z)
Entropy-Informed Weighting Channel Normalizing Flow [7.751853409569806]
We propose a regularized and feature-dependent $mathttShuffle$ operation and integrate it into vanilla multi-scale architecture. We observe that such operation guides the variables to evolve in the direction of entropy increase, hence we refer to NFs with the $mathttShuffle$ operation as emphEntropy-Informed Weighting Channel Normalizing Flow (EIW-Flow)
arXiv Detail & Related papers (2024-07-06T04:46:41Z)
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot [50.16171384920963]
transformer architecture has prevailed in various deep learning settings. One-layer transformer trained with gradient descent provably learns the sparse token selection task.
arXiv Detail & Related papers (2024-06-11T02:15:53Z)
Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL) We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z)
Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z)
Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time. This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z)
Factorized Fourier Neural Operators [77.47313102926017]
The Factorized Fourier Neural Operator (F-FNO) is a learning-based method for simulating partial differential equations. We show that our model maintains an error rate of 2% while still running an order of magnitude faster than a numerical solver.
arXiv Detail & Related papers (2021-11-27T03:34:13Z)
Learning Likelihoods with Conditional Normalizing Flows [54.60456010771409]
Conditional normalizing flows (CNFs) are efficient in sampling and inference. We present a study of CNFs where the base density to output space mapping is conditioned on an input x, to model conditional densities p(y|x)
arXiv Detail & Related papers (2019-11-29T19:17:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.