Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
- URL: http://arxiv.org/abs/2412.17739v2
- Date: Thu, 02 Jan 2025 08:58:38 GMT
- Title: Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
- Authors: Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xuekai Zhu, Bowen Zhou,
- Abstract summary: We show that Rotary Position Embedding (RoPE) enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform.
This periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation.
We propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization.
- Score: 23.936687072300053
- License:
- Abstract: Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.
Related papers
- VideoRoPE: What Makes for Good Video Rotary Position Embedding? [109.88966080843608]
VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.
VideoRoPE features textlow-frequency temporal allocation to mitigate periodic oscillations, a textitdiagonal layout to maintain spatial symmetry, and textadjustable temporal spacing to decouple temporal and spatial indexing.
arXiv Detail & Related papers (2025-02-07T18:56:04Z) - HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation [19.42279057349193]
positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.
We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
arXiv Detail & Related papers (2024-10-28T17:01:52Z) - Few-shot NeRF by Adaptive Rendering Loss Regularization [78.50710219013301]
Novel view synthesis with sparse inputs poses great challenges to Neural Radiance Field (NeRF)
Recent works demonstrate that the frequency regularization of Positional rendering can achieve promising results for few-shot NeRF.
We propose Adaptive Rendering loss regularization for few-shot NeRF, dubbed AR-NeRF.
arXiv Detail & Related papers (2024-10-23T13:05:26Z) - Round and Round We Go! What makes Rotary Positional Encodings useful? [15.543752938828831]
We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level.
We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies.
We propose a modification of RoPE that fixes some highlighted issues and improves performance.
arXiv Detail & Related papers (2024-10-08T17:07:01Z) - Frequency and Generalisation of Periodic Activation Functions in Reinforcement Learning [9.6812227037557]
We show that periodic activations learn low frequency representations and as a result avoid overfitting to bootstrapped targets.
We also show that weight decay regularization is able to partially offset the overfitting of periodic activation functions.
arXiv Detail & Related papers (2024-07-09T11:07:41Z) - Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective [35.947737679664016]
This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective.
Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
arXiv Detail & Related papers (2024-06-19T07:23:33Z) - Resonance RoPE: Improving Context Length Generalization of Large Language Models [37.749813693281254]
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE)
We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios.
We present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios.
arXiv Detail & Related papers (2024-02-29T19:02:03Z) - Frequency-domain MLPs are More Effective Learners in Time Series
Forecasting [67.60443290781988]
Time series forecasting has played the key role in different industrial domains, including finance, traffic, energy, and healthcare.
Most-based forecasting methods suffer from the point-wise mappings and information bottleneck.
We propose FreTS, a simple yet effective architecture built upon Frequency-domains for Time Series forecasting.
arXiv Detail & Related papers (2023-11-10T17:05:13Z) - Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value.
We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z) - DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting [83.60876685008225]
We introduce a deep expansion learning framework, DEPTS, for PTS forecasting.
DEPTS starts with a decoupled formulation by introducing the periodic state as a hidden variable.
Our two customized modules also have certain interpretable capabilities, such as attributing the forecasts to either local momenta or global periodicity.
arXiv Detail & Related papers (2022-03-15T06:51:58Z) - Deep Autoregressive Models with Spectral Attention [74.08846528440024]
We propose a forecasting architecture that combines deep autoregressive models with a Spectral Attention (SA) module.
By characterizing in the spectral domain the embedding of the time series as occurrences of a random process, our method can identify global trends and seasonality patterns.
Two spectral attention models, global and local to the time series, integrate this information within the forecast and perform spectral filtering to remove time series's noise.
arXiv Detail & Related papers (2021-07-13T11:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.