Revela: Dense Retriever Learning via Language Modeling
- URL: http://arxiv.org/abs/2506.16552v1
- Date: Thu, 19 Jun 2025 19:13:59 GMT
- Title: Revela: Dense Retriever Learning via Language Modeling
- Authors: Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl,
- Abstract summary: We introduce Revela, a unified training framework for self-supervised retriever learning via language modeling.<n>We evaluate Revela on both general-domain (BEIR) and domain-specific (CoIR) benchmarks across various retriever backbones.
- Score: 85.12131321155486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly and hard to obtain in specialized domains such as code-motivating growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next-token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next-token prediction on both local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on both general-domain (BEIR) and domain-specific (CoIR) benchmarks across various retriever backbones. At a comparable parameter scale, Revela outperforms the previous best method with absolute improvements of 5.2 % (18.3 % relative) and 5.6 % (14.4 % relative) on NDCG@10, respectively, underscoring its effectiveness. Performance increases with model size, highlighting both the scalability of our approach and its promise for self-supervised retriever learning.
Related papers
- Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z) - Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models [51.608246558235166]
SCARLet is a framework for training utility-based retrievers in RALMs.<n>It incorporates two key factors, multi-task generalization and inter-passage interaction.<n>We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain.
arXiv Detail & Related papers (2025-04-01T09:28:28Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling [42.67141329779589]
Grouped Cross Attention can generalize to 1000 times the pre-training context length.<n>Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths.
arXiv Detail & Related papers (2024-10-02T15:18:34Z) - Learning to Retrieve Iteratively for In-Context Learning [56.40100968649039]
iterative retrieval is a novel framework that empowers retrievers to make iterative decisions through policy optimization.
We instantiate an iterative retriever for composing in-context learning exemplars and apply it to various semantic parsing tasks.
By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever.
arXiv Detail & Related papers (2024-06-20T21:07:55Z) - Generate to Understand for Representation [3.5325087487696463]
GUR is a pretraining framework that combines language modeling and contrastive learning objectives in a single training step.
GUR achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting.
arXiv Detail & Related papers (2023-06-14T06:00:18Z) - BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer [1.911678487931003]
Retrieval-based language models are increasingly employed in question-answering tasks.
We develop the first Norwegian retrieval-based model by adapting the REALM framework.
We show that this type of training improves the reader's performance on extractive question-answering.
arXiv Detail & Related papers (2023-04-19T13:40:47Z) - Can Retriever-Augmented Language Models Reason? The Blame Game Between
the Retriever and the Language Model [33.729248437727634]
Augmenting pretrained language models with retrievers has shown promise in effectively solving common NLP problems.
We evaluate the strengths and weaknesses of popular retriever-augmented language models, namely kNN-LM, REALM, DPR + FiD, Contriever + ATLAS, and Contriever + Flan-T5.
arXiv Detail & Related papers (2022-12-18T19:27:41Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Leveraging Advantages of Interactive and Non-Interactive Models for
Vector-Based Cross-Lingual Information Retrieval [12.514666775853598]
We propose a novel framework to leverage the advantages of interactive and non-interactive models.
We introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries.
Our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
arXiv Detail & Related papers (2021-11-03T03:03:19Z) - How Context Affects Language Models' Factual Predictions [134.29166998377187]
We integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way.
We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.
arXiv Detail & Related papers (2020-05-10T09:28:12Z) - REALM: Retrieval-Augmented Language Model Pre-Training [37.3178586179607]
We augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia.
For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner.
We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA)
arXiv Detail & Related papers (2020-02-10T18:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.