Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval
- URL: http://arxiv.org/abs/2210.05521v3
- Date: Tue, 17 Oct 2023 07:25:38 GMT
- Title: Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval
- Authors: Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao
- Abstract summary: Inverted file structure is a common technique for accelerating dense retrieval.
In this work, we present the Hybrid Inverted Index (HI$2$), where the embedding clusters and salient terms work to accelerate dense retrieval.
- Score: 25.402767809863946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inverted file structure is a common technique for accelerating dense
retrieval. It clusters documents based on their embeddings; during searching,
it probes nearby clusters w.r.t. an input query and only evaluates documents
within them by subsequent codecs, thus avoiding the expensive cost of
exhaustive traversal. However, the clustering is always lossy, which results in
the miss of relevant documents in the probed clusters and hence degrades
retrieval quality. In contrast, lexical matching, such as overlaps of salient
terms, tends to be strong feature for identifying relevant documents. In this
work, we present the Hybrid Inverted Index (HI$^2$), where the embedding
clusters and salient terms work collaboratively to accelerate dense retrieval.
To make best of both effectiveness and efficiency, we devise a cluster selector
and a term selector, to construct compact inverted lists and efficiently
searching through them. Moreover, we leverage simple unsupervised algorithms as
well as end-to-end knowledge distillation to learn these two modules, with the
latter further boosting the effectiveness. Based on comprehensive experiments
on popular retrieval benchmarks, we verify that clusters and terms indeed
complement each other, enabling HI$^2$ to achieve lossless retrieval quality
with competitive efficiency across various index settings. Our code and
checkpoint are publicly available at
https://github.com/namespace-Pt/Adon/tree/HI2.
Related papers
- LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors [37.64658206917278]
LexBoost builds a network of dense neighbors (a corpus graph) using a dense retrieval approach while indexing.
We consider both a document's lexical relevance scores and its neighbors' scores to rank the documents.
We show that re-ranking on top of LexBoost outperforms traditional dense re-ranking and leads to results comparable with higher-latency exhaustive dense retrieval.
arXiv Detail & Related papers (2024-08-25T18:11:37Z) - Early Exit Strategies for Approximate k-NN Search in Dense Retrieval [10.48678957367324]
We build upon state-of-the-art for early exit A-kNN and propose an unsupervised method based on the notion of patience.
We show that our techniques improve the A-kNN efficiency with up to 5x speedups while achieving negligible effectiveness losses.
arXiv Detail & Related papers (2024-08-09T10:17:07Z) - ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items.
The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z) - SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)
DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster.
After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z) - Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner.
Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval.
We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image
Retrieval [20.53316810731414]
We propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$2$ Loss)
The proposed HyP$2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs.
arXiv Detail & Related papers (2022-08-14T15:06:27Z) - Overcomplete Deep Subspace Clustering Networks [80.16644725886968]
Experimental results on four benchmark datasets show the effectiveness of the proposed method over DSC and other clustering methods in terms of clustering error.
Our method is also not as dependent as DSC is on where pre-training should be stopped to get the best performance and is also more robust to noise.
arXiv Detail & Related papers (2020-11-16T22:07:18Z) - Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder.
Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.