Related papers: The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

URL: http://arxiv.org/abs/2012.14210v1
Date: Mon, 28 Dec 2020 12:25:25 GMT
Title: The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
Authors: Nils Reimers and Iryna Gurevych
Abstract summary: We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations.
Score: 61.78092651347371
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.

Related papers

Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks [9.013194002835123]
We study the surprising impact that truncating text embeddings has on downstream performance.<n>We find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed.
arXiv Detail & Related papers (2025-08-25T07:37:24Z)
Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings [9.879314903531286]
Prompt-based text embedding models generate task-specific embeddings upon receiving tailored prompts.<n>Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation.<n>For classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small.
arXiv Detail & Related papers (2025-06-02T08:50:38Z)
Answering Multimodal Exclusion Queries with Lightweight Sparse Disentangled Representations [20.355669581029396]
Multimodal representations that enable cross-modal retrieval are widely used.<n>These often lack interpretability making it difficult to explain the retrieved results.<n>We propose an approach that generates smaller dimensionality fixed-size embeddings that are disentangled.
arXiv Detail & Related papers (2025-04-04T05:23:45Z)
Towards Scalable Semantic Representation for Recommendation [65.06144407288127]
Mixture-of-Codes is proposed to construct semantic IDs based on large language models (LLMs) Our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.
arXiv Detail & Related papers (2024-10-12T15:10:56Z)
Useful Compact Representations for Data-Fitting [0.0]
We develop new compact representations that are parameterized by a choice of vectors and that reduce to existing well known formulas for special choices. We demonstrate effectiveness of the compact representations for large eigenvalue computations, tensor factorizations and nonlinear regressions.
arXiv Detail & Related papers (2024-03-18T19:43:00Z)
Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning [43.29587373211267]
In representation learning, a disentangled representation is highly desirable as it encodes generative factors of data in a separable and compact pattern. This paper further investigates the necessity of disentangled representation in downstream applications.
arXiv Detail & Related papers (2024-03-01T08:31:58Z)
Implications of sparsity and high triangle density for graph representation learning [67.98498239263549]
Recent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold.
arXiv Detail & Related papers (2022-10-27T09:15:15Z)
Learning-Based Dimensionality Reduction for Computing Compact and Effective Local Feature Descriptors [101.62384271200169]
A distinctive representation of image patches in form of features is a key component of many computer vision and robotics tasks. We investigate multi-layer perceptrons (MLPs) to extract low-dimensional but high-quality descriptors. We consider different applications, including visual localization, patch verification, image matching and retrieval.
arXiv Detail & Related papers (2022-09-27T17:59:04Z)
"Why Here and Not There?" -- Diverse Contrasting Explanations of Dimensionality Reduction [75.97774982432976]
We introduce the concept of contrasting explanations for dimensionality reduction. We apply a realization of this concept to the specific application of explaining two dimensional data visualization.
arXiv Detail & Related papers (2022-06-15T08:54:39Z)
Compressibility of Distributed Document Representations [0.0]
CoRe is a representation learner-agnostic framework suitable for representation compression. We show CoRe's behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Results based on more than 100,000 compression experiments indicate that CoRe offers a very good trade-off between the compression efficiency and performance.
arXiv Detail & Related papers (2021-10-14T17:56:35Z)
On Single and Multiple Representations in Dense Passage Retrieval [30.303705563808386]
Two dense retrieval families have become apparent: single representation and multiple representation. This paper contributes a direct study on their comparative effectiveness, noting situations where each method under/over performs w.r.t. each other, and w.r.t. a BM25 baseline. We also show that multiple representations obtain better improvements than single representations for queries that are the hardest for BM25, as well as for definitional queries.
arXiv Detail & Related papers (2021-08-13T15:01:53Z)
Minimizing FLOPs to Learn Efficient Sparse Representations [36.24540913526988]
We learn high dimensional and sparse representations that have similar representational capacity as dense embeddings. Our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.
arXiv Detail & Related papers (2020-04-12T18:09:02Z)
NCVis: Noise Contrastive Approach for Scalable Visualization [79.44177623781043]
NCVis is a high-performance dimensionality reduction method built on a sound statistical basis of noise contrastive estimation. We show that NCVis outperforms state-of-the-art techniques in terms of speed while preserving the representation quality of other methods.
arXiv Detail & Related papers (2020-01-30T15:43:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.