PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR
- URL: http://arxiv.org/abs/2601.18207v1
- Date: Mon, 26 Jan 2026 06:46:16 GMT
- Title: PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR
- Authors: James Burgess, Jan N. Hansen, Duo Peng, Yuhui Zhang, Alejandro Lozano, Min Woo Sun, Emma Lundberg, Serena Yeung-Levy,
- Abstract summary: We release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA.<n>We train search agents in this environment to outperform non-RL retrieval baselines.<n>Our data creation methods are scalable and easily extendable to other scientific domains.
- Score: 64.22412492998754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers -- this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training and released on https://huggingface.co/collections/jmhb/papersearchqa. Finally, our data creation methods are scalable and easily extendable to other scientific domains.
Related papers
- AgentIR: Reasoning-Aware Retrieval for Deep Research Agents [76.29382561831105]
Deep Research agents generate explicit natural language reasoning before each search call.<n> Reasoning-Aware Retrieval embeds the agent's reasoning trace alongside its query.<n>DR- Synth generates Deep Research retriever training data from standard QA datasets.<n>AgentIR-4B achieves 68% accuracy with the open-weight agent Tongyi-DeepResearch.
arXiv Detail & Related papers (2026-03-04T18:47:26Z) - Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text [90.21130870640057]
Golden Goose is a trick to synthesize unlimited RLVR tasks from unverifiable internet text.<n>We leverage reasoning-rich unverifiable corpora to synthesize GooseReason-0.7M, a large-scale RLVR dataset.<n>We also deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain.
arXiv Detail & Related papers (2026-01-30T13:39:11Z) - ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers? [29.17900668495058]
We introduce ReplicationBench, an evaluation framework for frontier AI agents.<n>It tests whether agents can replicate entire research papers drawn from the astrophysics literature.<n>R ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks.
arXiv Detail & Related papers (2025-10-28T16:21:19Z) - AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z) - SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents [93.26456498576181]
This paper focuses on the development of native Autonomous Single-Agent models for Deep Research.<n>Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark.
arXiv Detail & Related papers (2025-09-08T02:07:09Z) - From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z) - "Turing Tests" For An AI Scientist [0.0]
This paper proposes a "Turing test for an AI scientist" to assess whether an AI agent can conduct scientific research independently.
We propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains.
arXiv Detail & Related papers (2024-05-22T05:14:27Z) - PaperQA: Retrieval-Augmented Generative Agent for Scientific Research [41.9628176602676]
We present PaperQA, a RAG agent for answering questions over the scientific literature.
PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers.
We also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature.
arXiv Detail & Related papers (2023-12-08T18:50:20Z) - Around the GLOBE: Numerical Aggregation Question-Answering on
Heterogeneous Genealogical Knowledge Graphs with Deep Neural Networks [0.934612743192798]
We present a new end-to-end methodology for numerical aggregation QA for genealogical trees.
The proposed architecture, GLOBE, outperforms the state-of-the-art models and pipelines by achieving 87% accuracy for this task.
This study may have practical implications for genealogical information centers and museums.
arXiv Detail & Related papers (2023-07-30T12:09:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.