SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA
- URL: http://arxiv.org/abs/2509.25459v1
- Date: Mon, 29 Sep 2025 20:07:00 GMT
- Title: SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA
- Authors: Haozhou Xu, Dongxia Wu, Matteo Chinazzi, Ruijia Niu, Rose Yu, Yi-An Ma,
- Abstract summary: Large language models (LLMs) show promise in solving scientific problems.<n>They can help generate long-form answers for scientific questions.<n>LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering.
- Score: 35.02813727925432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) show promise in solving scientific problems. They can help generate long-form answers for scientific questions, which are crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. However, LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by incorporating external knowledge sources to improve trustworthiness. In this context, scientific simulators, which play a vital role in validating hypotheses, offer a particularly promising retrieval source to mitigate hallucination and enhance answer factuality. However, existing RAG approaches cannot be directly applied for scientific simulation-based retrieval due to two fundamental challenges: how to retrieve from scientific simulators, and how to efficiently verify and update long-form answers. To overcome these challenges, we propose the simulator-based RAG framework (SimulRAG) and provide a long-form scientific QA benchmark covering climate science and epidemiology with ground truth verified by both simulations and human annotators. In this framework, we propose a generalized simulator retrieval interface to transform between textual and numerical modalities. We further design a claim-level generation method that utilizes uncertainty estimation scores and simulator boundary assessment (UE+SBA) to efficiently verify and update claims. Extensive experiments demonstrate SimulRAG outperforms traditional RAG baselines by 30.4% in informativeness and 16.3% in factuality. UE+SBA further improves efficiency and quality for claim-level generation.
Related papers
- Grounding LLMs in Scientific Discovery via Embodied Actions [84.11877211907647]
Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and physical simulation.<n>We propose EmbodiedAct, a framework that transforms established scientific software into active embodied agents by groundings in embodied actions with a tight perception-execution loop.
arXiv Detail & Related papers (2026-02-24T07:37:18Z) - DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering [28.427433335623217]
We propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning.<n>This work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
arXiv Detail & Related papers (2026-01-23T06:19:08Z) - WildSci: Advancing Scientific Reasoning from In-the-Wild Literature [50.16160754134139]
We introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature.<n>By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals.<n>Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach.
arXiv Detail & Related papers (2026-01-09T06:35:23Z) - SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors [58.87134689752605]
We introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation.<n>We show that even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size.<n>We demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning.
arXiv Detail & Related papers (2025-10-20T13:14:38Z) - PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors [29.988641224102164]
textscPhysGym is a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning.<n>textscPhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent.
arXiv Detail & Related papers (2025-07-21T12:28:10Z) - G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration [48.948187359727996]
G-Sim is a hybrid framework that automates simulator construction with rigorous empirical calibration.<n>It produces reliable, causally-informed simulators, mitigating data-inefficiency and enabling robust system-level interventions.
arXiv Detail & Related papers (2025-06-10T22:14:34Z) - Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models [20.648157071328807]
Large language models (LLMs) can identify novel research directions by analyzing existing knowledge.
LLMs are prone to generating hallucinations'', outputs that are plausible-sounding but factually incorrect.
We propose KG-CoI, a system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs.
arXiv Detail & Related papers (2024-11-04T18:50:00Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - ClimSim-Online: A Large Multi-scale Dataset and Framework for Hybrid ML-physics Climate Emulation [45.201929285600606]
We present ClimSim-Online, which includes an end-to-end workflow for developing hybrid ML-physics simulators.
The dataset is global and spans ten years at a high sampling frequency.
We provide a cross-platform, containerized pipeline to integrate ML models into operational climate simulators.
arXiv Detail & Related papers (2023-06-14T21:26:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.