Bridging the Gap Between Indexing and Retrieval for Differentiable
Search Index with Query Generation
- URL: http://arxiv.org/abs/2206.10128v3
- Date: Fri, 7 Jul 2023 04:08:33 GMT
- Title: Bridging the Gap Between Indexing and Retrieval for Differentiable
Search Index with Query Generation
- Authors: Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido
Zuccon, and Daxin Jiang
- Abstract summary: Differentiable Search Index (DSI) is an emerging paradigm for information retrieval.
We propose a simple yet effective indexing framework for DSI, called DSI-QG.
Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.
- Score: 98.02743096197402
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The Differentiable Search Index (DSI) is an emerging paradigm for information
retrieval. Unlike traditional retrieval architectures where index and retrieval
are two different and separate components, DSI uses a single transformer model
to perform both indexing and retrieval.
In this paper, we identify and tackle an important issue of current DSI
models: the data distribution mismatch that occurs between the DSI indexing and
retrieval processes. Specifically, we argue that, at indexing, current DSI
methods learn to build connections between the text of long documents and the
identifier of the documents, but then retrieval of document identifiers is
based on queries that are commonly much shorter than the indexed documents.
This problem is further exacerbated when using DSI for cross-lingual retrieval,
where document text and query text are in different languages.
To address this fundamental problem of current DSI models, we propose a
simple yet effective indexing framework for DSI, called DSI-QG. When indexing,
DSI-QG represents documents with a number of potentially relevant queries
generated by a query generation model and re-ranked and filtered by a
cross-encoder ranker. The presence of these queries at indexing allows the DSI
models to connect a document identifier to a set of queries, hence mitigating
data distribution mismatches present between the indexing and the retrieval
phases. Empirical results on popular mono-lingual and cross-lingual passage
retrieval datasets show that DSI-QG significantly outperforms the original DSI
model.
Related papers
- De-DSI: Decentralised Differentiable Search Index [0.0]
De-DSI is a framework that fuses large language models with genuine decentralization for information retrieval.
It uses the differentiable search index (DSI) concept in a decentralized setting to efficiently connect novel user queries with document identifiers.
arXiv Detail & Related papers (2024-04-18T14:51:55Z) - LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance.
There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results.
We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z) - Searching, fast and slow, through product catalogs [5.077235981745305]
We present a unified architecture for SKU search that provides both a real-time suggestion system and a lower latency search system.
We show how our system vastly outperforms, in all aspects, the results provided by the default search engine.
arXiv Detail & Related papers (2024-01-01T12:30:46Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - DSI++: Updating Transformer Memory with New Documents [95.70264288158766]
We introduce DSI++, a continual learning challenge for DSI to incrementally index new documents.
We show that continual indexing of new documents leads to considerable forgetting of previously indexed documents.
We introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task.
arXiv Detail & Related papers (2022-12-19T18:59:34Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Transformer Memory as a Differentiable Search Index [102.41278496436948]
We introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids.
We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes.
arXiv Detail & Related papers (2022-02-14T19:12:43Z) - Improving Query Representations for Dense Retrieval with Pseudo
Relevance Feedback [29.719150565643965]
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval.
ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels.
Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
arXiv Detail & Related papers (2021-08-30T18:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.