Related papers: WaveletGPT: Wavelets Meet Large Language Models

WaveletGPT: Wavelets Meet Large Language Models

URL: http://arxiv.org/abs/2409.12924v2
Date: Thu, 3 Oct 2024 09:21:57 GMT
Title: WaveletGPT: Wavelets Meet Large Language Models
Authors: Prateek Verma,
Abstract summary: Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. We achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music.
Score: 1.2328446298523066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. They are trained on a simple objective: to predict the next token given the previous context. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure associated with it. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture, we achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music. This is achieved by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every Transformer decoder block. This work will hopefully pave the way for incorporating multi-rate signal processing ideas into traditional LLM pre-training. Further, we showcase pushing model performance by improving internal structure instead of just going after scale.

Related papers

Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison [27.44915531637358]
We compare the performance of dense feature prepending (DFP) and cross-attention architecture. Despite the wide adoption of DFP, our results do not indicate a clear advantage of DFP over cross-attention.
arXiv Detail & Related papers (2025-01-04T20:14:16Z)
Whisper-GPT: A Hybrid Representation Audio Large Language Model [1.2328446298523066]
A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.
arXiv Detail & Related papers (2024-12-16T05:03:48Z)
Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization [74.3339999119713]
We develop a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon.
arXiv Detail & Related papers (2024-12-06T18:22:59Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders. During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z)
Adaptive Large Language Models By Layerwise Attention Shortcuts [46.76681147411957]
LLM-like setups allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture.
arXiv Detail & Related papers (2024-09-17T03:46:01Z)
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs) We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z)
Towards Signal Processing In Large Language Models [46.76681147411957]
This paper introduces the idea of applying signal processing inside a Large Language Model (LLM) We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance.
arXiv Detail & Related papers (2024-06-10T13:51:52Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Content Adaptive Front End For Audio Classification [1.0435741631709403]
We propose a learnable content adaptive front end for audio signal processing.<n>We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector.
arXiv Detail & Related papers (2023-03-18T16:09:10Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones [0.0]
We propose a baseline (off-the-shelf) for Continual Learning of Computer Vision problems. We exploit the power of pretrained models to compute a class prototype and fill a memory bank. We compare our pipeline with common CNN models and show the superiority of Vision Transformers.
arXiv Detail & Related papers (2022-05-03T16:03:46Z)
Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or .... [4.594159253008448]
This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation.
arXiv Detail & Related papers (2021-10-07T05:00:26Z)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention. We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
Fully Learnable Deep Wavelet Transform for Unsupervised Monitoring of High-Frequency Time Series [2.7793394375935088]
High-Frequency (HF) signal are ubiquitous in the industrial world and are of great use for the monitoring of industrial assets. Most deep learning tools are designed for inputs of fixed and/or very limited size and successful applications of deep learning to the industrial context use as inputs extracted features. We propose a fully unsupervised deep learning framework that is able to extract meaningful and sparse representation of raw HF signals.
arXiv Detail & Related papers (2021-05-03T14:35:06Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.