Replication and Exploration of Generative Retrieval over Dynamic Corpora
- URL: http://arxiv.org/abs/2504.17519v1
- Date: Thu, 24 Apr 2025 13:01:23 GMT
- Title: Replication and Exploration of Generative Retrieval over Dynamic Corpora
- Authors: Zhen Zhang, Xinyu Ma, Weiwei Sun, Pengjie Ren, Zhumin Chen, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun Ren,
- Abstract summary: Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR)<n>We show that existing GR models with numericittext-based docids show superior generalization to unseen documents.<n>We propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids.
- Score: 87.09185685594105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textit{text-based} docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textit{numeric-based} docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with language models' pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.
Related papers
- Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models [2.9687381456164004]
It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency.
The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek.
The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion.
arXiv Detail & Related papers (2025-04-28T02:50:45Z) - SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs [70.79124435220695]
We propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE)<n>We first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation.<n>We then introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination.
arXiv Detail & Related papers (2025-04-17T17:59:27Z) - Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [0.0]
The research aims to improve retrieval and generation accuracy by introducing Persian-specific models.<n>Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports- were used to assess these models.<n>MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets.
arXiv Detail & Related papers (2025-01-08T22:16:40Z) - Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling [63.98194996746229]
Large language models (LLMs) are prone to hallucination and producing factually incorrect information.
We propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search.
arXiv Detail & Related papers (2024-12-19T13:55:48Z) - Loops On Retrieval Augmented Generation (LoRAG) [0.0]
Loops On Retrieval Augmented Generation (LoRAG) is a new framework designed to enhance the quality of retrieval-augmented text generation.
The architecture integrates a generative model, a retrieval mechanism, and a dynamic loop module.
LoRAG surpasses existing state-of-the-art models in terms of BLEU score, ROUGE score, and perplexity.
arXiv Detail & Related papers (2024-03-18T15:19:17Z) - Assessing generalization capability of text ranking models in Polish [0.0]
Retrieval-augmented generation (RAG) is becoming an increasingly popular technique for integrating internal knowledge bases with large language models.
In this article, we focus on the reranking problem for the Polish language, examining the performance of rerankers.
The best of our models establishes a new state-of-the-art for reranking in the Polish language, outperforming existing models with up to 30 times more parameters.
arXiv Detail & Related papers (2024-02-22T06:21:41Z) - CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks [20.390672895839757]
Retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy.
Traditional retrieval modules often rely on large document index and disconnect with generative tasks.
We propose textbfCorpusLM, a unified language model that integrates generative retrieval, closed-book generation, and RAG.
arXiv Detail & Related papers (2024-02-02T06:44:22Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling.
We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units.
Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.