Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings
- URL: http://arxiv.org/abs/2506.01435v1
- Date: Mon, 02 Jun 2025 08:50:38 GMT
- Title: Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings
- Authors: Hayato Tsukagoshi, Ryohei Sasano,
- Abstract summary: Prompt-based text embedding models generate task-specific embeddings upon receiving tailored prompts.<n>Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation.<n>For classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small.
- Score: 9.879314903531286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.
Related papers
- The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure [91.01653854955286]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z) - Static Pruning in Dense Retrieval using Matrix Decomposition [12.899105656025018]
In the era of dense retrieval, document indexing and retrieval is largely based on encoding models that transform text documents into embeddings.<n>Recent studies have shown that it is possible to reduce embedding size without sacrificing - and in some cases improving - the retrieval effectiveness.<n>We present a novel static pruning method for reducing the dimensionality of embeddings using Principal Components Analysis.
arXiv Detail & Related papers (2024-12-13T09:09:20Z) - Anti-Collapse Loss for Deep Metric Learning Based on Coding Rate Metric [99.19559537966538]
DML aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval.
To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss.
Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-07-03T13:44:20Z) - Embedding Compression for Efficient Re-Identification [0.0]
ReID algorithms aim to map new observations of an object to previously recorded instances.
We benchmark quantization-aware-training along with three different dimension reduction methods.
We find that ReID embeddings can be compressed by up to 96x with minimal drop in performance.
arXiv Detail & Related papers (2024-05-23T15:57:11Z) - Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings [28.35953315232521]
Sentence embeddings produced by Pretrained Language Models (PLMs) have received wide attention from the NLP community.
High dimensionality of the sentence embeddings produced by PLMs is problematic when representing large numbers of sentences in memory- or compute-constrained devices.
We evaluate unsupervised dimensionality reduction methods to reduce the dimensionality of sentence embeddings produced by PLMs.
arXiv Detail & Related papers (2024-03-20T21:58:32Z) - On the Dimensionality of Sentence Embeddings [56.86742006079451]
We show that the optimal dimension of sentence embeddings is usually smaller than the default value.
We propose a two-step training method for sentence representation learning models, wherein the encoder and the pooler are optimized separately to mitigate the overall performance loss.
arXiv Detail & Related papers (2023-10-23T18:51:00Z) - An evaluation framework for dimensionality reduction through sectional
curvature [59.40521061783166]
In this work, we aim to introduce the first highly non-supervised dimensionality reduction performance metric.
To test its feasibility, this metric has been used to evaluate the performance of the most commonly used dimension reduction algorithms.
A new parameterized problem instance generator has been constructed in the form of a function generator.
arXiv Detail & Related papers (2023-03-17T11:59:33Z) - DimenFix: A novel meta-dimensionality reduction method for feature
preservation [64.0476282000118]
We propose a novel meta-method, DimenFix, which can be operated upon any base dimensionality reduction method that involves a gradient-descent-like process.
By allowing users to define the importance of different features, which is considered in dimensionality reduction, DimenFix creates new possibilities to visualize and understand a given dataset.
arXiv Detail & Related papers (2022-11-30T05:35:22Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Exploring Dimensionality Reduction Techniques in Multilingual
Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers.
It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z) - Dimensionality Reduction for Sentiment Classification: Evolving for the
Most Prominent and Separable Features [4.156782836736784]
In sentiment classification, the enormous amount of textual data, its immense dimensionality, and inherent noise make it extremely difficult for machine learning classifiers to extract high-level and complex abstractions.
In the existing dimensionality reduction techniques, the number of components needs to be set manually which results in loss of the most prominent features.
We have proposed a new framework that consists of two-dimensionality reduction techniques i.e., Sentiment Term Presence Count (SentiTPC) and Sentiment Term Presence Ratio (SentiTPR)
arXiv Detail & Related papers (2020-06-01T09:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.