Related papers: SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

Related papers

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache [67.47789629197857]
We propose a training-free framework that exploits the heterogeneous roles of transformer head dimensions.<n>By projecting the long-context-insensitive dimensions onto Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients.<n>We show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack.
arXiv Detail & Related papers (2025-06-13T15:35:54Z)
Block Circulant Adapter for Large Language Models [10.353352027807272]
Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs. We propose a block circulant-based matrix fine-tuning method with a stable training to leverage the properties of circulant matrices.
arXiv Detail & Related papers (2025-05-01T15:14:32Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers [18.469378618426294]
We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains.<n>We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
arXiv Detail & Related papers (2025-02-03T19:24:01Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective runtime. For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely $textbf0.07x$, bringing up to $textbf2.3x$ speedup for prefill and $textbf1.7x$ speedup for decoding.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Neural Fourier Modelling: A Highly Compact Approach to Time-Series Analysis [9.969451740838418]
We introduce Neural Fourier Modelling (NFM), a compact yet powerful solution for time-series analysis. NFM is grounded in two key properties of the Fourier transform (FT): (i) the ability to model finite-length time series as functions in the Fourier domain, and (ii) the capacity for data manipulation within the Fourier domain. NFM achieves state-of-the-art performance on a wide range of tasks, including challenging time-series scenarios with previously unseen sampling rates at test time.
arXiv Detail & Related papers (2024-10-07T02:39:55Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.<n>HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.<n>We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
Diffusion Gated Linear Attention Transformers (DiG) is a simple, adoptable solution with minimal parameter overhead. We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness.
arXiv Detail & Related papers (2024-05-28T17:59:33Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.<n>These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.<n>We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
Parameter-Efficient Fine-Tuning with Discrete Fourier Transform [26.563344030824414]
Low-rank adaptation(LoRA) has recently gained much interest in fine-tuning foundation models. We introduce FourierFT, which treats $Delta W$ as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients. Our method shows comparable or better performance with fewer parameters than LoRA on various tasks.
arXiv Detail & Related papers (2024-05-05T17:15:24Z)
ATFNet: Adaptive Time-Frequency Ensembled Network for Long-term Time Series Forecasting [7.694820760102176]
ATFNet is an innovative framework that combines a time domain module and a frequency domain module. We introduce Dominant Harmonic Series Energy Weighting, a novel mechanism for adjusting the weights between the two modules. Our Complex-valued Spectrum Attention mechanism offers a novel approach to discern the intricate relationships between different frequency combinations.
arXiv Detail & Related papers (2024-04-08T04:41:39Z)
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation [31.657608562937543]
We introduce GRIFFIN, a training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation. GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks.
arXiv Detail & Related papers (2024-04-01T17:56:06Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance [87.19164603145056]
We propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency. We design an anchor-free head solely based on to adapt PETR, enabling better performance with less computational overhead.
arXiv Detail & Related papers (2024-03-08T11:41:48Z)
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop. In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z)
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores [18.016204763652553]
Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks. Fast Fourier Transform (FFT) allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution.
arXiv Detail & Related papers (2023-11-10T07:33:35Z)
p-Laplacian Transformer [7.2541371193810384]
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. We first show that the self-attention mechanism obtains the minimal Laplacian regularization. We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
arXiv Detail & Related papers (2023-11-06T16:25:56Z)
WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series Forecasting [61.64303388738395]
We propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting. Tests on various time series datasets show WFTNet consistently outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-20T13:44:18Z)
Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF [6.135925201075925]
We propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $&$ Temples, respectively.
arXiv Detail & Related papers (2023-07-28T06:21:42Z)
Adaptive Frequency Filters As Efficient Global Token Mixers [100.27957692579892]
We show that adaptive frequency filters can serve as efficient global token mixers. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet.
arXiv Detail & Related papers (2023-07-26T07:42:28Z)
Recurrence without Recurrence: Stable Video Landmark Detection with Deep Equilibrium Models [96.76758318732308]
We show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation. Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the WFLW facial landmark dataset.
arXiv Detail & Related papers (2023-04-02T19:08:02Z)
Dynamic Temporal Filtering in Video Models [128.02725199486719]
We present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF) DTF learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics. It is feasible to plug DTF block into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer.
arXiv Detail & Related papers (2022-11-15T15:59:28Z)
Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning. We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg. We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Long-term Leap Attention, Short-term Periodic Shift for Video Classification [41.87505528859225]
Video transformer naturally incurs a heavier computation burden than a static vision transformer. We propose the LAPS, a long-term textbftextitLeap Attention'' (LAN), short-term textbftextitPeriodic Shift'' (textitP-Shift) module for video transformers.
arXiv Detail & Related papers (2022-07-12T13:30:15Z)
Deep Frequency Filtering for Domain Generalization [55.66498461438285]
Deep Neural Networks (DNNs) have preferences for some frequency components in the learning process. We propose Deep Frequency Filtering (DFF) for learning domain-generalizable features. We show that applying our proposed DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks.
arXiv Detail & Related papers (2022-03-23T05:19:06Z)
Functional Regularization for Reinforcement Learning via Learned Fourier Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis. We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.