DSI++: Updating Transformer Memory with New Documents
- URL: http://arxiv.org/abs/2212.09744v3
- Date: Fri, 8 Dec 2023 05:20:31 GMT
- Title: DSI++: Updating Transformer Memory with New Documents
- Authors: Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q.
Tran, Jinfeng Rao, Marc Najork, Emma Strubell, Donald Metzler
- Abstract summary: We introduce DSI++, a continual learning challenge for DSI to incrementally index new documents.
We show that continual indexing of new documents leads to considerable forgetting of previously indexed documents.
We introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task.
- Score: 95.70264288158766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Differentiable Search Indices (DSIs) encode a corpus of documents in model
parameters and use the same model to answer user queries directly. Despite the
strong performance of DSI models, deploying them in situations where the corpus
changes over time is computationally expensive because reindexing the corpus
requires re-training the model. In this work, we introduce DSI++, a continual
learning challenge for DSI to incrementally index new documents while being
able to answer queries related to both previously and newly indexed documents.
Across different model scales and document identifier representations, we show
that continual indexing of new documents leads to considerable forgetting of
previously indexed documents. We also hypothesize and verify that the model
experiences forgetting events during training, leading to unstable learning. To
mitigate these issues, we investigate two approaches. The first focuses on
modifying the training dynamics. Flatter minima implicitly alleviate
forgetting, so we optimize for flatter loss basins and show that the model
stably memorizes more documents ($+12\%$). Next, we introduce a generative
memory to sample pseudo-queries for documents and supplement them during
continual indexing to prevent forgetting for the retrieval task. Extensive
experiments on novel continual indexing benchmarks based on Natural Questions
(NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting
significantly. Concretely, it improves the average Hits@10 by $+21.1\%$ over
competitive baselines for NQ and requires $6$ times fewer model updates
compared to re-training the DSI model for incrementally indexing five corpora
in a sequence.
Related papers
- UpLIF: An Updatable Self-Tuning Learned Index Framework [4.077820670802213]
UpLIF is an adaptive self-tuning learned index that adjusts the model to accommodate incoming updates.
We also introduce the concept of balanced model adjustment, which determines the model's inherent properties.
arXiv Detail & Related papers (2024-08-07T22:30:43Z) - List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - IncDSI: Incrementally Updatable Document Retrieval [35.5697863674097]
IncDSI is a method to add documents in real time without retraining the model on the entire dataset.
We formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters.
Our approach is competitive with re-training the model on the whole dataset.
arXiv Detail & Related papers (2023-07-19T07:20:30Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - Bridging the Gap Between Indexing and Retrieval for Differentiable
Search Index with Query Generation [98.02743096197402]
Differentiable Search Index (DSI) is an emerging paradigm for information retrieval.
We propose a simple yet effective indexing framework for DSI, called DSI-QG.
Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.
arXiv Detail & Related papers (2022-06-21T06:21:23Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Transformer Memory as a Differentiable Search Index [102.41278496436948]
We introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids.
We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes.
arXiv Detail & Related papers (2022-02-14T19:12:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.