Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
- URL: http://arxiv.org/abs/2504.21015v2
- Date: Tue, 21 Oct 2025 06:22:08 GMT
- Title: Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
- Authors: Aarush Sinha,
- Abstract summary: Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora.<n>We propose an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text.<n>Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders (CE), which require full corpus access. We propose a corpus-free alternative: an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text. Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields, serving as positive passages for query generation. We evaluate two fine-tuning configurations of DistilBERT for dense retrieval; one using LLM-generated hard negatives conditioned solely on the query, and another using negatives generated with both the query and its positive document as context. Compared to traditional corpus-based mining methods {LLM Query $\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR benchmark datasets, our all-LLM pipeline outperforms strong lexical mining baselines and achieves performance comparable to cross-encoder-based methods, demonstrating the potential of corpus-free hard negative generation for retrieval model training.
Related papers
- Adaptation of Embedding Models to Financial Filings via LLM Distillation [10.744318713371383]
This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation.<n>Our method yields an average of 27.7% improvement in MRR$textt@$5, 44.6% improvement in mean DCG$textt@$5 across 14 financial filing types measured over 21,800 query-document pairs.
arXiv Detail & Related papers (2025-12-08T22:43:14Z) - SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation [30.096211889103998]
We introduce Step COmpression for Process Estimation (SCOPE), a novel compression-based approach that significantly reduces annotation costs.<n>We construct a large-scale dataset containing 196K samples with only 5% of the computational resources required by previous methods.<n> Empirical results demonstrate that PRMs trained on our dataset consistently outperform existing automated annotation approaches on both Best-of-N strategy and ProcessBench.
arXiv Detail & Related papers (2025-05-20T14:31:15Z) - Optimizing Retrieval Augmented Generation for Object Constraint Language [3.4777703321218225]
OCL is essential for Model-Based Systems Engineering (MBSE) but manually writing OCL rules is complex and time-consuming.<n>We evaluate the impact of three different retrieval strategies on $OCLBERT generation.<n>We show that while retrieval can enhance generation accuracy, its effectiveness depends on the retrieval method and the number of retrieved chunks.
arXiv Detail & Related papers (2025-05-19T14:00:10Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - Let your LLM generate a few tokens and you will reduce the need for retrieval [1.0878040851638]
Large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory.<n>We distill an LLM-as-a-judge to compute the IK (I Know) score.
arXiv Detail & Related papers (2024-12-16T08:13:14Z) - Data Fusion of Synthetic Query Variants With Generative Large Language Models [1.864807003137943]
This work explores the feasibility of using synthetic query variants generated by instruction-tuned Large Language Models in data fusion experiments.
We introduce a lightweight, unsupervised, and cost-efficient approach that exploits principled prompting and data fusion techniques.
Our analysis shows that data fusion based on synthetic query variants is significantly better than baselines with single queries and also outperforms pseudo-relevance feedback methods.
arXiv Detail & Related papers (2024-11-06T12:54:27Z) - Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism [2.919891871101241]
Transformers have a quadratic scaling of computational complexity with input size.
Retrieval-augmented generation (RAG) can better handle longer contexts by using a retrieval system.
We introduce a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR)
arXiv Detail & Related papers (2024-10-11T19:49:05Z) - Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Traceable LLM-based validation of statements in knowledge graphs [0.0]
This article presents a method for verifying RDF triples using LLMs.<n>Because the LLMs cannot currently reliably identify the origin of the information used to construct the response to the user prompt, our approach is to avoid using internal LLM factual knowledge altogether.<n>Instead, verified RDF statements are compared to chunks of external documents retrieved through a web search or Wikipedia.
arXiv Detail & Related papers (2024-09-11T12:27:41Z) - Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.<n>NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.<n>In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - A Monte Carlo Language Model Pipeline for Zero-Shot Sociopolitical Event Extraction [4.818309069556584]
Event extraction could allow researchers to flexibly specify arbitrary event classes for new research questions.
Current zero-shot EE methods, as well as a naive zero-shot approach of simple generative language model (LM) prompting, perform poorly for dyadic event extraction.
We address these challenges with a new fine-grained, multi-stage instruction-following generative LM pipeline.
We demonstrate our pipeline's application to dyadic international relations analysis.
arXiv Detail & Related papers (2023-05-24T11:41:33Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - Large Language Models are Strong Zero-Shot Retriever [89.16756291653371]
We propose a simple method that applies a large language model (LLM) to large-scale retrieval in zero-shot scenarios.
Our method, the Language language model as Retriever (LameR), is built upon no other neural models but an LLM.
arXiv Detail & Related papers (2023-04-27T14:45:55Z) - Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs)
query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets.
Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Training Data is More Valuable than You Think: A Simple and Effective
Method by Retrieving from Training Data [82.92758444543689]
Retrieval-based methods have been shown to be effective in NLP tasks via introducing external knowledge.
Surprisingly, we found that REtrieving from the traINing datA (REINA) only can lead to significant gains on multiple NLG and NLU tasks.
Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks.
arXiv Detail & Related papers (2022-03-16T17:37:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.