InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems
- URL: http://arxiv.org/abs/2508.13930v1
- Date: Tue, 19 Aug 2025 15:23:18 GMT
- Title: InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems
- Authors: Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei,
- Abstract summary: This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR)<n>We first assess thegator of the original InPars, InPars-V2, and Prompta pipelines on the SciFact benchmark.<n>We introduce two key extensions to the pipeline: fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts.
- Score: 3.09578981466695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval performance. All code, models, and synthetic datasets are publicly released to support further research at: \href{https://github.com/danilotpnta/IR2-project}{this https URL}.
Related papers
- RouteRAG: Efficient Retrieval-Augmented Generation from Text and Graph via Reinforcement Learning [69.87510139069218]
Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs)<n>Recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL)<n>We introduce model, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG.
arXiv Detail & Related papers (2025-12-10T10:05:31Z) - RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z) - Rethinking On-policy Optimization for Query Augmentation [49.87723664806526]
We present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks.<n>We introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which learns to generate a pseudo-document that maximizes retrieval performance.
arXiv Detail & Related papers (2025-10-20T04:16:28Z) - Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4410890572479]
We introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification.<n>LoongBench is a curated seed dataset containing 8,729 human-vetted examples across 12 domains.<n>LoongEnv is a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples.
arXiv Detail & Related papers (2025-09-03T06:42:40Z) - When Retriever Meets Generator: A Joint Model for Code Comment Generation [3.6781644685120924]
RAGSum is built on top offuse retrieval and generation using a single CodeT5 backbone.<n>A contrastive pre-training phase shapes code embeddings for nearest-neighbor search.<n>A lightweight self-refinement loop is deployed to polish the final output.
arXiv Detail & Related papers (2025-07-16T18:12:27Z) - Ext2Gen: Alignment through Unified Extraction and Generation for Robust Retrieval-Augmented Generation [18.570899885235104]
We propose Ext2Gen, a novel extract-then-generate model that enhances RAG by extracting query-relevant sentences before generating answers.<n>Experiments demonstrate that Ext2Gen effectively identifies query-relevant sentences with high precision and recall, leading to highly reliable answers.
arXiv Detail & Related papers (2025-02-28T06:46:53Z) - BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.<n>This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.<n>Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers? [72.42500059688396]
We show that it is possible to improve the generalization of a strong neural ranker, by prompt engineering and aggregating the ranking results of each expanded query via fusion.
Experiments on BEIR and TREC Deep Learning show that the nDCG@10 scores of both MonoT5 and RankT5 following these steps are improved.
arXiv Detail & Related papers (2023-11-15T18:11:41Z) - Prompt Generate Train (PGT): Few-shot Domain Adaption of Retrieval
Augmented Generation Models for Open Book Question-Answering [0.0]
We propose a framework to efficiently develop a generative question-answering model for open-book question-answering over a proprietary collection of text documents.
The framework adapts a retriever augmented generation (RAG) model to the target domain using supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2023-07-12T04:44:31Z) - InPars-v2: Large Language Models as Efficient Dataset Generators for
Information Retrieval [4.888022358881737]
We introduce InPars-v2, a dataset generator that uses open-source LLMs and powerful rerankers to select synthetic query-document pairs for training.
A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2023-01-04T20:58:43Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - Neural Pipeline for Zero-Shot Data-to-Text Generation [3.42658286826597]
We propose to generate text by transforming single-item descriptions with a sequence of modules trained on general-domain text-based operations.
Our experiments on two major triple-to-text datasets -- WebNLG and E2E -- show that our approach enables D2T generation from RDF triples in zero-shot settings.
arXiv Detail & Related papers (2022-03-30T13:14:35Z) - Recent Developments Combining Ensemble Smoother and Deep Generative
Networks for Facies History Matching [58.720142291102135]
This research project focuses on the use of autoencoders networks to construct a continuous parameterization for facies models.
We benchmark seven different formulations, including VAE, generative adversarial network (GAN), Wasserstein GAN, variational auto-encoding GAN, principal component analysis (PCA) with cycle GAN, PCA with transfer style network and VAE with style loss.
arXiv Detail & Related papers (2020-05-08T21:32:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.