Improving Data and Reward Design for Scientific Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2602.08321v2
- Date: Tue, 10 Feb 2026 12:26:18 GMT
- Title: Improving Data and Reward Design for Scientific Reasoning in Large Language Models
- Authors: Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong, Peng Cheng,
- Abstract summary: We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset.<n>Dr. SCI dataset comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers.
- Score: 40.51540159966056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
Related papers
- CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning [44.519834940763964]
CHIMERA is a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning.<n>It has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy.<n>It achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam.
arXiv Detail & Related papers (2026-03-01T03:23:41Z) - DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems [21.193573505360266]
We introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets.<n>DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results.
arXiv Detail & Related papers (2026-01-20T04:44:36Z) - WildSci: Advancing Scientific Reasoning from In-the-Wild Literature [50.16160754134139]
We introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature.<n>By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals.<n>Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach.
arXiv Detail & Related papers (2026-01-09T06:35:23Z) - Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [32.21228080662089]
We present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level textbooks.<n>We introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances.<n>Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths.
arXiv Detail & Related papers (2025-07-22T17:59:03Z) - Foundation Models for Spatio-Temporal Data Science: A Tutorial and Survey [69.0648659029394]
Spatio-Temporal (ST) data science is fundamental to understanding complex systems in domains such as urban computing, climate science, and intelligent transportation.<n>Researchers have begun exploring the concept of Spatio-Temporal Foundation Models (STFMs) to enhance adaptability and generalization across diverse ST tasks.<n>STFMs empower the entire workflow of ST data science, ranging from data sensing, management, to mining, thereby offering a more holistic and scalable approach.
arXiv Detail & Related papers (2025-03-12T09:42:18Z) - Towards Effective and Efficient Continual Pre-training of Large Language Models [163.34610964970258]
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks.
This paper presents a technical report for continually pre-training Llama-3 (8B)
It significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
arXiv Detail & Related papers (2024-07-26T13:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.