Related papers: IncDSI: Incrementally Updatable Document Retrieval

IncDSI: Incrementally Updatable Document Retrieval

URL: http://arxiv.org/abs/2307.10323v2
Date: Mon, 19 Aug 2024 07:02:19 GMT
Title: IncDSI: Incrementally Updatable Document Retrieval
Authors: Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger,
Abstract summary: IncDSI is a method to add documents in real time without retraining the model on the entire dataset. We formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Our approach is competitive with re-training the model on the whole dataset.
Score: 35.5697863674097
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.

Related papers

MURR: Model Updating with Regularized Replay for Searching a Document Stream [32.0637790321157]
Internet produces a continuous stream of new documents and user-generated queries. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing.
arXiv Detail & Related papers (2025-04-14T14:13:03Z)
Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents [31.434507306952458]
We propose KNN-former, which incorporates a new kind of bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities. We also use matching spatial to address the one-to-one mapping property that exists in many documents. Our method is highly-efficient compared to existing approaches in terms of the number of trainable parameters.
arXiv Detail & Related papers (2024-05-08T10:10:38Z)
PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. We propose PDFTriage that enables models to retrieve the context based on either structure or content. Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z)
DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
DSI++: Updating Transformer Memory with New Documents [95.70264288158766]
We introduce DSI++, a continual learning challenge for DSI to incrementally index new documents. We show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task.
arXiv Detail & Related papers (2022-12-19T18:59:34Z)
DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop [0.0]
DoSA (Document Specific Automated s) helps annotators in generating initial annotations automatically using our novel bootstrap approach. An open-source ready-to-use implementation is made available on GitHub.
arXiv Detail & Related papers (2022-11-09T15:04:07Z)
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions. Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task. To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.