Related papers: Length-Induced Embedding Collapse in Transformer-based Models

Length-Induced Embedding Collapse in Transformer-based Models

URL: http://arxiv.org/abs/2410.24200v1
Date: Thu, 31 Oct 2024 17:55:36 GMT
Title: Length-Induced Embedding Collapse in Transformer-based Models
Authors: Yuqi Zhou, Sunhao Dai, Zhanshuo Cao, Xiao Zhang, Jun Xu,
Abstract summary: We find that performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, hurting the performance of downstream tasks. We propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax() which achieves a higher low-filter attenuation rate.
Score: 7.127156731612495
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax(), which achieves a higher low-filter attenuation rate. The tuning-free method, called TempScale, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models, especially on long text inputs, bringing up to 0.53% performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and 0.82% performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval.

Related papers

LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation [79.90766312484489]
Long Context Pre-training with Restoration Distillation (LongReD) LongReD distills the hidden state of selected layers from the original model on short texts. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance.
arXiv Detail & Related papers (2025-02-11T08:37:16Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
Length-Aware Multi-Kernel Transformer for Long Document Classification [4.796752450839119]
Lengthy documents pose a unique challenge to neural language models due to substantial memory consumption. We propose a Length-Aware Multi- Kernel Transformer (LAMKIT) to address the new challenges for the long document classification.
arXiv Detail & Related papers (2024-05-11T16:48:06Z)
LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text. We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Streamlining Redundant Layers to Compress Large Language Models [21.27944103424621]
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs) It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.
arXiv Detail & Related papers (2024-03-28T04:12:13Z)
LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers [20.23085795744602]
We propose textbfAdaptive textbfSparsity textbfLevel (textbfPALS) to automatically seek a decent balance between loss and sparsity. PALS draws inspiration from sparse training and during-training methods. It introduces the novel "expand" mechanism in training sparse neural networks, allowing the model to dynamically shrink, expand, or remain stable to find a proper sparsity level.
arXiv Detail & Related papers (2023-05-28T06:57:27Z)
Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers [20.10172411803626]
We propose a compositional soft attention architecture that applies RoBERTa sentence-wise to extract plausible rationales at the token-level. We find this method to significantly outperform Longformer-driven baselines on sentiment classification datasets.
arXiv Detail & Related papers (2023-03-14T15:45:35Z)
Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation [28.94748226472447]
We study the pros and cons of the standard transformer in document-level translation. We propose a surprisingly simple long-short term masking self-attention on top of the standard transformer. We can achieve a strong result in BLEU and capture discourse phenomena.
arXiv Detail & Related papers (2020-09-19T00:29:51Z)
Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection [10.265607222257263]
We propose a Controllable Time-delay Transformer (CT-Transformer) model that jointly completes the punctuation prediction and disfluency detection tasks in real time. The proposed approach outperforms the previous state-of-the-art models on F-scores and achieves a competitive inference speed.
arXiv Detail & Related papers (2020-03-03T03:17:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.