On the token distance modeling ability of higher RoPE attention dimension
- URL: http://arxiv.org/abs/2410.08703v2
- Date: Mon, 21 Oct 2024 08:49:18 GMT
- Title: On the token distance modeling ability of higher RoPE attention dimension
- Authors: Xiangyu Hong, Che Jiang, Biqing Qi, Fandong Meng, Mo Yu, Bowen Zhou, Jie Zhou,
- Abstract summary: We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies.
We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models.
These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
- Score: 76.55792402912027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual information remains elusive. Based on the intuition that different dimensions correspond to different frequency of changes in RoPE encoding, we conducted a dimension-level analysis to investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. Using our correlation metric, we identified a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing, as evidence by our ablation. We further demonstrate the correlation between the efficiency of length extrapolation and the extension of the high-dimensional attention allocation of these heads. The identification of Positional Heads provides insights for future research in long-text comprehension.
Related papers
- HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation [19.42279057349193]
positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.
We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
arXiv Detail & Related papers (2024-10-28T17:01:52Z) - Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement [62.87020831987625]
We propose a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations.
We select the most challenging samples as the influential data to effectively frame the long-range dependencies.
Experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
arXiv Detail & Related papers (2024-10-21T04:30:53Z) - Caformer: Rethinking Time Series Analysis from Causal Perspective [7.354128514581098]
We introduce a novel framework called Caformer for time series analysis from a causal perspective.
Our framework comprises three components: Dynamic Learner, Environment Learner, and Dependency Learner.
Our Caformer demonstrates consistent state-of-the-art performance across five mainstream time series analysis tasks.
arXiv Detail & Related papers (2024-03-13T14:28:02Z) - Unveiling the Secrets of Engaging Conversations: Factors that Keep Users
Hooked on Role-Playing Dialog Agents [17.791787477586574]
The degree to which the bot embodies the roles it plays has limited influence on retention rates, while the length of each turn it speaks significantly affects retention rates.
This study sheds light on the critical aspects of user engagement with role-playing models and provides valuable insights for future improvements in the development of large language models for role-playing purposes.
arXiv Detail & Related papers (2024-02-18T09:42:41Z) - Picking the Underused Heads: A Network Pruning Perspective of Attention
Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing.
We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z) - KERPLE: Kernelized Relative Positional Embedding for Length
Extrapolation [72.71398034617607]
KERPLE is a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences.
The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way.
arXiv Detail & Related papers (2022-05-20T01:25:57Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - Continual Learning of Long Topic Sequences in Neural Information
Retrieval [2.3846478553599098]
We first propose a dataset based upon the MSMarco corpus aiming at modeling a long stream of topics.
We then in-depth analyze the ability of recent neural IR models while continually learning those streams.
arXiv Detail & Related papers (2022-01-10T14:19:09Z) - Enriched Attention for Robust Relation Extraction [10.925904231385207]
relation extraction models do not scale well to long sentences with multiple entities and relations.
Attention allows the model to focus on parts of the input sentence that are relevant to relation extraction.
Our model outperforms prior work using comparable setups on two popular benchmarks.
arXiv Detail & Related papers (2021-04-22T07:17:19Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.