Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval
- URL: http://arxiv.org/abs/2210.05521v3
- Date: Tue, 17 Oct 2023 07:25:38 GMT
- Title: Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval
- Authors: Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao
- Abstract summary: Inverted file structure is a common technique for accelerating dense retrieval.
In this work, we present the Hybrid Inverted Index (HI$2$), where the embedding clusters and salient terms work to accelerate dense retrieval.
- Score: 25.402767809863946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inverted file structure is a common technique for accelerating dense
retrieval. It clusters documents based on their embeddings; during searching,
it probes nearby clusters w.r.t. an input query and only evaluates documents
within them by subsequent codecs, thus avoiding the expensive cost of
exhaustive traversal. However, the clustering is always lossy, which results in
the miss of relevant documents in the probed clusters and hence degrades
retrieval quality. In contrast, lexical matching, such as overlaps of salient
terms, tends to be strong feature for identifying relevant documents. In this
work, we present the Hybrid Inverted Index (HI$^2$), where the embedding
clusters and salient terms work collaboratively to accelerate dense retrieval.
To make best of both effectiveness and efficiency, we devise a cluster selector
and a term selector, to construct compact inverted lists and efficiently
searching through them. Moreover, we leverage simple unsupervised algorithms as
well as end-to-end knowledge distillation to learn these two modules, with the
latter further boosting the effectiveness. Based on comprehensive experiments
on popular retrieval benchmarks, we verify that clusters and terms indeed
complement each other, enabling HI$^2$ to achieve lossless retrieval quality
with competitive efficiency across various index settings. Our code and
checkpoint are publicly available at
https://github.com/namespace-Pt/Adon/tree/HI2.
Related papers
- SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)
DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster.
After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z) - Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner.
Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval.
We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z) - Lexically-Accelerated Dense Retrieval [29.327878974130055]
'LADR' (Lexically-Accelerated Dense Retrieval) is a simple-yet-effective approach that improves the efficiency of existing dense retrieval models.
LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.
arXiv Detail & Related papers (2023-07-31T15:44:26Z) - Precise Zero-Shot Dense Retrieval without Relevance Labels [60.457378374671656]
Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system.
We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
arXiv Detail & Related papers (2022-12-20T18:09:52Z) - Genie: A new, fast, and outlier-resistant hierarchical clustering
algorithm [3.7491936479803054]
We propose a new hierarchical clustering linkage criterion called Genie.
Our algorithm links two clusters in such a way that a chosen economic inequity measure does not drastically increase above a given threshold.
A reference implementation of the algorithm has been included in the open source 'genie' package for R.
arXiv Detail & Related papers (2022-09-13T06:42:53Z) - HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image
Retrieval [20.53316810731414]
We propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$2$ Loss)
The proposed HyP$2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs.
arXiv Detail & Related papers (2022-08-14T15:06:27Z) - A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes.
Machine learning models are developed to approximate the position of each data record on the disk.
Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z) - Improving Document Representations by Generating Pseudo Query Embeddings
for Dense Retrieval [11.465218502487959]
We design a method to mimic the queries on each of the documents by an iterative clustering process.
We also optimize the matching function with a two-step score calculation procedure.
Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results.
arXiv Detail & Related papers (2021-05-08T05:28:24Z) - Overcomplete Deep Subspace Clustering Networks [80.16644725886968]
Experimental results on four benchmark datasets show the effectiveness of the proposed method over DSC and other clustering methods in terms of clustering error.
Our method is also not as dependent as DSC is on where pre-training should be stopped to get the best performance and is also more robust to noise.
arXiv Detail & Related papers (2020-11-16T22:07:18Z) - Progressively Pretrained Dense Corpus Index for Open-Domain Question
Answering [87.32442219333046]
We propose a simple and resource-efficient method to pretrain the paragraph encoder.
Our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.
arXiv Detail & Related papers (2020-04-30T18:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.