Related papers: ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

URL: http://arxiv.org/abs/2601.21654v1
Date: Thu, 29 Jan 2026 12:51:44 GMT
Title: ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval
Authors: Hao Shen, Hang Yang, Zhouhong Gu,
Abstract summary: We present ScholarGym, a simulation environment for reproducible evaluation of deep research on academic literature.<n>Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth.
Score: 11.41528830724814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement.

Related papers

AnalyticsGPT: An LLM Workflow for Scientometric Question Answering [1.5658704610960574]
AnalyticsGPT is an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering.<n>This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering.
arXiv Detail & Related papers (2026-02-10T14:23:55Z)
MADE: Benchmark Environments for Closed-Loop Materials Discovery [26.575933886112157]
We introduce MAterials Discovery Environments (MADE), a novel framework for benchmarking end-to-end autonomous materials discovery pipelines.<n>We formalize discovery as a search for thermodynamically stable compounds relative to a given convex hull, and evaluate efficacy and efficiency via comparison to baseline algorithms.<n>We demonstrate this by conducting systematic experiments across a family of systems, enabling ablation of components in discovery pipelines, and comparison of how methods scale with system complexity.
arXiv Detail & Related papers (2026-01-28T19:46:46Z)
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z)
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation [56.886936435727854]
DeepResearchEval is an automated framework for deep research task construction and agentic evaluation.<n>For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles.<n>For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
arXiv Detail & Related papers (2026-01-14T18:38:31Z)
DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z)
LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology [3.470217255779291]
We introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis.<n>Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries.<n> Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful agent responses.
arXiv Detail & Related papers (2025-09-17T13:51:29Z)
Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance.<n>We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Deep Learning Schema-based Event Extraction: Literature Review and Current Trends [60.29289298349322]
Event extraction technology based on deep learning has become a research hotspot. This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models.
arXiv Detail & Related papers (2021-07-05T16:32:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.