Related papers: DSI++: Updating Transformer Memory with New Documents

DSI++: Updating Transformer Memory with New Documents

URL: http://arxiv.org/abs/2212.09744v3
Date: Fri, 8 Dec 2023 05:20:31 GMT
Title: DSI++: Updating Transformer Memory with New Documents
Authors: Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Jinfeng Rao, Marc Najork, Emma Strubell, Donald Metzler
Abstract summary: We introduce DSI++, a continual learning challenge for DSI to incrementally index new documents. We show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task.
Score: 95.70264288158766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents ($+12\%$). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.

Related papers

MURR: Model Updating with Regularized Replay for Searching a Document Stream [32.0637790321157]
Internet produces a continuous stream of new documents and user-generated queries. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing.
arXiv Detail & Related papers (2025-04-14T14:13:03Z)
UpLIF: An Updatable Self-Tuning Learned Index Framework [4.077820670802213]
UpLIF is an adaptive self-tuning learned index that adjusts the model to accommodate incoming updates. We also introduce the concept of balanced model adjustment, which determines the model's inherent properties.
arXiv Detail & Related papers (2024-08-07T22:30:43Z)
List-aware Reranking-Truncation Joint Model for Search and Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently. GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture. Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z)
IncDSI: Incrementally Updatable Document Retrieval [35.5697863674097]
IncDSI is a method to add documents in real time without retraining the model on the entire dataset. We formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Our approach is competitive with re-training the model on the whole dataset.
arXiv Detail & Related papers (2023-07-19T07:20:30Z)
How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales. We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z)
Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation [98.02743096197402]
Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. We propose a simple yet effective indexing framework for DSI, called DSI-QG. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.
arXiv Detail & Related papers (2022-06-21T06:21:23Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Transformer Memory as a Differentiable Search Index [102.41278496436948]
We introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes.
arXiv Detail & Related papers (2022-02-14T19:12:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.