Related papers: ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research

ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research

URL: http://arxiv.org/abs/2506.11117v1
Date: Mon, 09 Jun 2025 11:47:13 GMT
Title: ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research
Authors: Junyong Lin, Lu Dai, Ruiqian Han, Yijie Sui, Ruilin Wang, Xingliang Sun, Qinglin Wu, Min Feng, Hao Liu, Hui Xiong,
Abstract summary: We develop ScIRGen, a dataset generation framework for scientific QA & retrieval.<n>We use it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers.
Score: 15.983924435685553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA \& retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers' validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.

Related papers

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR [64.22412492998754]
We release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA.<n>We train search agents in this environment to outperform non-RL retrieval baselines.<n>Our data creation methods are scalable and easily extendable to other scientific domains.
arXiv Detail & Related papers (2026-01-26T06:46:16Z)
ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery [6.780086370528623]
We introduce textbfReSearch, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery.<n>ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture.<n>Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods.
arXiv Detail & Related papers (2026-01-20T17:27:12Z)
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z)
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z)
A GenAI System for Improved FAIR Independent Biological Database Integration [0.0]
We introduce an experimental natural language-based query processing system designed to empower scientists to discover, access, and query biological databases.<n> FAIRBridge harnesses the capabilities of AI to interpret query intents, map them to relevant databases, and generate executable queries.<n>The system also includes robust tools for mitigating low-quality query processing, ensuring high fidelity and responsiveness in the information delivered.
arXiv Detail & Related papers (2025-06-22T08:04:24Z)
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation [63.55258191625131]
InfoDeepSeek is a new benchmark for assessing agentic information seeking in real-world, dynamic web environments.<n>We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity.<n>We develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes.
arXiv Detail & Related papers (2025-05-21T14:44:40Z)
CS-PaperSum: A Large-Scale Dataset of AI-Generated Summaries for Scientific Papers [3.929864777332447]
CS-PaperSum is a large-scale dataset of 91,919 papers from 31 top-tier computer science conferences.<n>Our dataset enables automated literature analysis, research trend forecasting, and AI-driven scientific discovery.
arXiv Detail & Related papers (2025-02-27T22:48:35Z)
iTRI-QA: a Toolset for Customized Question-Answer Dataset Generation Using Language Models for Enhanced Scientific Research [1.2411445143550854]
We present a tool for the development of a customized question-answer (QA) dataset, called Interactive Trained Research Innovator (iTRI) - QA.<n>Our approach integrates curated QA datasets with a specialized research paper dataset to enhance responses' contextual relevance and accuracy using fine-tuned LM.<n>This pipeline provides a dynamic and domain-specific QA system that will be applied for future research LM deployment.
arXiv Detail & Related papers (2025-01-27T23:38:39Z)
Synthetic Data Generation with Large Language Models for Personalized Community Question Answering [47.300506002171275]
We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Our findings suggest that LLMs have high potential in generating data tailored to users' needs. The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.
arXiv Detail & Related papers (2024-10-29T16:19:08Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT. To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z)
DATED: Guidelines for Creating Synthetic Datasets for Engineering Design Applications [3.463438487417909]
This study proposes comprehensive guidelines for generating, annotating, and validating synthetic datasets. The study underscores the importance of thoughtful sampling methods to ensure the appropriate size, diversity, utility, and realism of a dataset. Overall, this paper offers valuable insights for researchers intending to create and publish synthetic datasets for engineering design.
arXiv Detail & Related papers (2023-05-15T21:00:09Z)
Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.