Fourier Head: Helping Large Language Models Learn Complex Probability Distributions
- URL: http://arxiv.org/abs/2410.22269v1
- Date: Tue, 29 Oct 2024 17:27:58 GMT
- Title: Fourier Head: Helping Large Language Models Learn Complex Probability Distributions
- Authors: Nate Gillman, Daksh Aggarwal, Michael Freeman, Saurabh Singh, Chen Sun,
- Abstract summary: We introduce a neural network layer, constructed using Fourier series, which we can easily substitute for any linear layer if we want the outputs to have a more continuous structure.
We perform extensive analysis on synthetic datasets, as well as on large-scale decision making and time series forecasting tasks.
All of our results support the effectiveness of our proposed Fourier head in scenarios where the underlying data distribution has a natural continuous structure.
- Score: 7.074506869260538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the quality of large language models has improved, there has been increased interest in using them to model non-linguistic tokens. For example, the Decision Transformer recasts agentic decision making as a sequence modeling problem, using a decoder-only LLM to model the distribution over the discrete action space for an Atari agent. However, when adapting LLMs to non-linguistic domains, it remains unclear if softmax over discrete bins captures the continuous structure of the tokens and the potentially complex distributions needed for high quality token generation. We introduce a neural network layer, constructed using Fourier series, which we can easily substitute for any linear layer if we want the outputs to have a more continuous structure. We perform extensive analysis on synthetic datasets, as well as on large-scale decision making and time series forecasting tasks. We also provide theoretical evidence that this layer can better learn signal from data while ignoring high-frequency noise. All of our results support the effectiveness of our proposed Fourier head in scenarios where the underlying data distribution has a natural continuous structure. For example, the Fourier head improves a Decision Transformer agent's returns by 46% on the Atari Seaquest game, and increases a state-of-the-art times series foundation model's forecasting performance by 3.5% across 20 benchmarks unseen during training.
Related papers
- Output Scaling: YingLong-Delayed Chain of Thought in a Large Pretrained Time Series Forecasting Model [55.25659103706409]
This framework achieves state-of-the-art performance for our designed foundation model, YingLong.<n>YingLong is a non-causal, bidirectional attention encoder-only transformer trained through masked token recovery.<n>We release four foundation models ranging from 6M to 300M parameters, demonstrating superior results in zero-shot tasks.
arXiv Detail & Related papers (2025-05-20T14:31:06Z) - OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain [24.24834151329251]
OLinear is a $mathbfo$rthogonally transformed domain that operates in a $mathbfo$rthogonally transformed domain.<n>We introduce a customized linear layer, $mathbfNormLin$, which employs a normalized weight matrix to capture multivariate dependencies.<n>Experiments on 24 benchmarks and 140 forecasting tasks demonstrate that OLinear consistently achieves state-of-the-art performance with high efficiency.
arXiv Detail & Related papers (2025-05-12T10:39:37Z) - State Fourier Diffusion Language Model (SFDLM): A Scalable, Novel Iterative Approach to Language Modeling [0.0]
This paper introduces a fully diffusion driven discrete text generation model built without any transformer or large convolution modules.
By composing local state space updates with global Fourier based mixing, the approach effectively captures both short and long range dependencies.
arXiv Detail & Related papers (2025-03-16T02:17:40Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - Sampling Foundational Transformer: A Theoretical Perspective [12.7600763629179]
We propose Foundational Sampling Transformer (SFT) that can work on multiple data modalities.
SFT has achieved competitive results on many benchmarks, while being faster in inference, compared to other very specialized models.
arXiv Detail & Related papers (2024-08-11T16:53:09Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.
We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z) - Inferring Data Preconditions from Deep Learning Models for Trustworthy
Prediction in Deployment [25.527665632625627]
It is important to reason about the trustworthiness of the model's predictions with unseen data during deployment.
Existing methods for specifying and verifying traditional software are insufficient for this task.
We propose a novel technique that uses rules derived from neural network computations to infer data preconditions.
arXiv Detail & Related papers (2024-01-26T03:47:18Z) - A Transformer-based Framework For Multi-variate Time Series: A Remaining
Useful Life Prediction Use Case [4.0466311968093365]
This work proposed an encoder-transformer architecture-based framework for time series prediction.
We validated the effectiveness of the proposed framework on all four sets of the C-MAPPS benchmark dataset.
To enable the model awareness of the initial stages of the machine life and its degradation path, a novel expanding window method was proposed.
arXiv Detail & Related papers (2023-08-19T02:30:35Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Generative Text Modeling through Short Run Inference [47.73892773331617]
The present work proposes a short run dynamics for inference. It is variation from the prior distribution of the latent variable and then runs a small number of Langevin dynamics steps guided by its posterior distribution.
We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse.
arXiv Detail & Related papers (2021-05-27T09:14:35Z) - Generalizing Variational Autoencoders with Hierarchical Empirical Bayes [6.273154057349038]
We present Hierarchical Empirical Bayes Autoencoder (HEBAE), a computationally stable framework for probabilistic generative models.
Our key contributions are two-fold. First, we make gains by placing a hierarchical prior over the encoding distribution, enabling us to adaptively balance the trade-off between minimizing the reconstruction loss function and avoiding over-regularization.
arXiv Detail & Related papers (2020-07-20T18:18:39Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.