Related papers: Length-Induced Embedding Collapse in PLM-based Models

Length-Induced Embedding Collapse in PLM-based Models

URL: http://arxiv.org/abs/2410.24200v2
Date: Tue, 10 Jun 2025 07:26:49 GMT
Title: Length-Induced Embedding Collapse in PLM-based Models
Authors: Yuqi Zhou, Sunhao Dai, Zhanshuo Cao, Xiao Zhang, Jun Xu,
Abstract summary: We introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together.<n>We investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks.<n>To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon.
Score: 7.127156731612495
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of 0.94% on MTEB and 1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at https://github.com/Yuqi-Zhou/Length_Collapse.

Related papers

Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models [1.5817866616624976]
Large language models (LLMs) often struggle to accurately read and comprehend long texts.<n>Current methods for improvement typically rely on splitting long contexts into fixed-length chunks.<n>We propose a straightforward approach for dynamically separating and selecting chunks of long context.
arXiv Detail & Related papers (2025-06-01T01:42:40Z)
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation [79.90766312484489]
Long Context Pre-training with Restoration Distillation (LongReD) LongReD distills the hidden state of selected layers from the original model on short texts. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance.
arXiv Detail & Related papers (2025-02-11T08:37:16Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.<n>Perplexity (PPL) has proven unreliable for assessing long-context capabilities.<n>We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
LLM$\ imes$MapReduce: Simplified Long-Sequence Processing using Large Language Models [73.13933847198395]
We propose a training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output.
arXiv Detail & Related papers (2024-10-12T03:13:44Z)
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models [5.330795983408874]
We introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text.<n>The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks.
arXiv Detail & Related papers (2024-09-07T03:54:46Z)
Length-Aware Multi-Kernel Transformer for Long Document Classification [4.796752450839119]
Lengthy documents pose a unique challenge to neural language models due to substantial memory consumption. We propose a Length-Aware Multi- Kernel Transformer (LAMKIT) to address the new challenges for the long document classification.
arXiv Detail & Related papers (2024-05-11T16:48:06Z)
LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text. We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Streamlining Redundant Layers to Compress Large Language Models [21.27944103424621]
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs) It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.
arXiv Detail & Related papers (2024-03-28T04:12:13Z)
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models [48.35385912526338]
This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs) We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. We show that the degradation trend appears in every version of our dataset, although at different intensities.
arXiv Detail & Related papers (2024-02-19T16:04:53Z)
LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers [20.23085795744602]
We propose textbfAdaptive textbfSparsity textbfLevel (textbfPALS) to automatically seek a decent balance between loss and sparsity. PALS draws inspiration from sparse training and during-training methods. It introduces the novel "expand" mechanism in training sparse neural networks, allowing the model to dynamically shrink, expand, or remain stable to find a proper sparsity level.
arXiv Detail & Related papers (2023-05-28T06:57:27Z)
Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers [20.10172411803626]
We propose a compositional soft attention architecture that applies RoBERTa sentence-wise to extract plausible rationales at the token-level. We find this method to significantly outperform Longformer-driven baselines on sentiment classification datasets.
arXiv Detail & Related papers (2023-03-14T15:45:35Z)
Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation [28.94748226472447]
We study the pros and cons of the standard transformer in document-level translation. We propose a surprisingly simple long-short term masking self-attention on top of the standard transformer. We can achieve a strong result in BLEU and capture discourse phenomena.
arXiv Detail & Related papers (2020-09-19T00:29:51Z)
Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection [10.265607222257263]
We propose a Controllable Time-delay Transformer (CT-Transformer) model that jointly completes the punctuation prediction and disfluency detection tasks in real time. The proposed approach outperforms the previous state-of-the-art models on F-scores and achieves a competitive inference speed.
arXiv Detail & Related papers (2020-03-03T03:17:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.