Wavelet-based Positional Representation for Long Context
- URL: http://arxiv.org/abs/2502.02004v1
- Date: Tue, 04 Feb 2025 04:44:53 GMT
- Title: Wavelet-based Positional Representation for Long Context
- Authors: Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito,
- Abstract summary: We analyze conventional position encoding methods for long contexts.
We propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms.
Experimental results show that this new method improves the performance of the model in both short and long contexts.
- Score: 14.902305283428642
- License:
- Abstract: In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model's attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model's attention field.
Related papers
- Utilizing Image Transforms and Diffusion Models for Generative Modeling of Short and Long Time Series [7.201938834736084]
We propose a unified generative model for varying-length time series.
We employ invertible transforms such as the delay embedding and the short-time Fourier transform.
We show that our approach achieves consistently state-of-the-art results against strong baselines.
arXiv Detail & Related papers (2024-10-25T13:06:18Z) - Boundary-Recovering Network for Temporal Action Detection [20.517156879086535]
Large temporal scale variation of actions is one of the most primary difficulties in temporal action detection (TAD)
We propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem.
BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length.
arXiv Detail & Related papers (2024-08-18T04:34:49Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Dynamically Modulating Visual Place Recognition Sequence Length For Minimum Acceptable Performance Scenarios [17.183024395686505]
Single image visual place recognition (VPR) provides an alternative for localization but often requires techniques such as sequence matching to improve robustness.
We present an approach which uses a calibration set of data to fit a model that modulates sequence length for VPR as needed to exceed a target localization performance.
arXiv Detail & Related papers (2024-07-01T00:16:35Z) - Mitigate Position Bias in Large Language Models via Scaling a Single Dimension [47.792435921037274]
This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias.
It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states.
Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states.
arXiv Detail & Related papers (2024-06-04T17:55:38Z) - A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences.
We introduce a relative position embedding to explicitly maximize attention resolution.
We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z) - Dissecting Transformer Length Extrapolation via the Lens of Receptive
Field Analysis [72.71398034617607]
We dissect a relative positional embedding design, ALiBi, via the lens of receptive field analysis.
We modify the vanilla Sinusoidal positional embedding to create bftext, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence.
arXiv Detail & Related papers (2022-12-20T15:40:17Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z) - FFD: Fast Feature Detector [22.51804239092462]
We show that robust and accurate keypoints exist in the specific scale-space domain.
It is proved that setting the scale-space pyramid's smoothness ratio and blurring to 2 and 0.627, respectively, facilitates the detection of reliable keypoints.
arXiv Detail & Related papers (2020-12-01T21:56:35Z) - NiLBS: Neural Inverse Linear Blend Skinning [59.22647012489496]
We introduce a method to invert the deformations undergone via traditional skinning techniques via a neural network parameterized by pose.
The ability to invert these deformations allows values (e.g., distance function, signed distance function, occupancy) to be pre-computed at rest pose, and then efficiently queried when the character is deformed.
arXiv Detail & Related papers (2020-04-06T20:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.