Wavelet GPT: Wavelet Inspired Large Language Models
- URL: http://arxiv.org/abs/2409.12924v4
- Date: Sun, 09 Feb 2025 23:09:31 GMT
- Title: Wavelet GPT: Wavelet Inspired Large Language Models
- Authors: Prateek Verma,
- Abstract summary: Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements.
This paper infuses LLMs with a traditional signal processing idea, namely wavelets, during pre-training to take advantage of the structure.
We achieve the same pre-training performance almost twice as fast in text, audio, and images.
- Score: 1.2328446298523066
- License:
- Abstract: Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure. This paper infuses LLMs with a traditional signal processing idea, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture in an academic setup, we achieve the same pre-training performance almost twice as fast in text, audio, and images. This is done by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Further, we show this extends to the Long Range Arena benchmark and several input representations such as characters, BPE tokens, bytes, waveform, math expression, and image pixels. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every decoder block. We hope this will pave the way for incorporating multi-rate signal processing into pre-training.
Related papers
- Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison [27.44915531637358]
We compare the performance of dense feature prepending (DFP) and cross-attention architecture.
Despite the wide adoption of DFP, our results do not indicate a clear advantage of DFP over cross-attention.
arXiv Detail & Related papers (2025-01-04T20:14:16Z) - Whisper-GPT: A Hybrid Representation Audio Large Language Model [1.2328446298523066]
A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture.
We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.
arXiv Detail & Related papers (2024-12-16T05:03:48Z) - Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization [74.3339999119713]
We develop a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies.
Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon.
arXiv Detail & Related papers (2024-12-06T18:22:59Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Adaptive Large Language Models By Layerwise Attention Shortcuts [46.76681147411957]
LLM-like setups allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism.
We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture.
arXiv Detail & Related papers (2024-09-17T03:46:01Z) - Towards Signal Processing In Large Language Models [46.76681147411957]
This paper introduces the idea of applying signal processing inside a Large Language Model (LLM)
We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations.
We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance.
arXiv Detail & Related papers (2024-06-10T13:51:52Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Fully Learnable Deep Wavelet Transform for Unsupervised Monitoring of
High-Frequency Time Series [2.7793394375935088]
High-Frequency (HF) signal are ubiquitous in the industrial world and are of great use for the monitoring of industrial assets.
Most deep learning tools are designed for inputs of fixed and/or very limited size and successful applications of deep learning to the industrial context use as inputs extracted features.
We propose a fully unsupervised deep learning framework that is able to extract meaningful and sparse representation of raw HF signals.
arXiv Detail & Related papers (2021-05-03T14:35:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.