Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks
- URL: http://arxiv.org/abs/2508.17744v2
- Date: Tue, 07 Oct 2025 13:43:18 GMT
- Title: Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks
- Authors: Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto,
- Abstract summary: We study the surprising impact that truncating text embeddings has on downstream performance.<n>We find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed.
- Score: 9.013194002835123
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.
Related papers
- The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure [91.01653854955286]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z) - Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings [9.879314903531286]
Prompt-based text embedding models generate task-specific embeddings upon receiving tailored prompts.<n>Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation.<n>For classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small.
arXiv Detail & Related papers (2025-06-02T08:50:38Z) - When Dimensionality Hurts: The Role of LLM Embedding Compression for Noisy Regression Tasks [17.109522466982476]
We show that compressed representations of text can yield better performance in regression tasks.<n>Our results suggest that the success of interpretable compressed representations such as sentiment may be due to a regularising effect.
arXiv Detail & Related papers (2025-02-04T10:23:11Z) - Static Pruning in Dense Retrieval using Matrix Decomposition [12.899105656025018]
In the era of dense retrieval, document indexing and retrieval is largely based on encoding models that transform text documents into embeddings.<n>Recent studies have shown that it is possible to reduce embedding size without sacrificing - and in some cases improving - the retrieval effectiveness.<n>We present a novel static pruning method for reducing the dimensionality of embeddings using Principal Components Analysis.
arXiv Detail & Related papers (2024-12-13T09:09:20Z) - On the Dimensionality of Sentence Embeddings [56.86742006079451]
We show that the optimal dimension of sentence embeddings is usually smaller than the default value.
We propose a two-step training method for sentence representation learning models, wherein the encoder and the pooler are optimized separately to mitigate the overall performance loss.
arXiv Detail & Related papers (2023-10-23T18:51:00Z) - Rediscovering Hashed Random Projections for Efficient Quantization of
Contextualized Sentence Embeddings [113.38884267189871]
Training and inference on edge devices often requires an efficient setup due to computational limitations.
Pre-computing data representations and caching them on a server can mitigate extensive edge device computation.
We propose a simple, yet effective approach that uses randomly hyperplane projections.
We show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%--99% of their floating-point.
arXiv Detail & Related papers (2023-03-13T10:53:00Z) - The Curse of Dense Low-Dimensional Information Retrieval for Large Index
Sizes [61.78092651347371]
We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes.
In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations.
arXiv Detail & Related papers (2020-12-28T12:25:25Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.