Related papers: WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

URL: http://arxiv.org/abs/2601.05567v1
Date: Fri, 09 Jan 2026 06:35:23 GMT
Title: WildSci: Advancing Scientific Reasoning from In-the-Wild Literature
Authors: Tengxiao Liu, Deepak Nathani, Zekun Li, Kevin Yang, William Yang Wang,
Abstract summary: We introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature.<n>By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals.<n>Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach.
Score: 50.16160754134139
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at https://huggingface.co/datasets/JustinTX/WildSci.

Related papers

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.3527268311731]
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM)<n>We operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.<n>Our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
arXiv Detail & Related papers (2025-12-18T12:44:36Z)
Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z)
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z)
SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications [63.92604046592333]
Video foundation models (FMs) hold considerable promise as general-purpose domain-agnostic approaches.<n>We introduce SciVid, a benchmark comprising five tasks across medical computer vision, animal behavior, and weather forecasting.<n>We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating potential for effective transfer learning.
arXiv Detail & Related papers (2025-07-04T13:48:12Z)
BLADE: Benchmarking Language Model Agents for Data-Driven Science [21.682416167339635]
LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science.<n>We present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions.
arXiv Detail & Related papers (2024-08-19T02:59:35Z)
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled. We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [97.31347312130119]
SciRIFF (Scientific Resource for Instruction-Following and Finetuning) is a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks.<n>These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification.<n> SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
An Interdisciplinary Outlook on Large Language Models for Scientific Research [3.4108358650013573]
We describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications. We articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use.
arXiv Detail & Related papers (2023-11-03T19:41:09Z)
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts [37.00741148951341]
We propose GeSS, a benchmark designed for evaluating the performance of GDL models in scientific scenarios with distribution shifts. Our evaluation datasets cover diverse scientific domains from particle physics, materials science to biochemistry, and encapsulate a broad spectrum of distribution shifts. Overall, our benchmark results in 30 different experiment settings, and evaluates 3 GDL backbones and 11 learning algorithms in each setting.
arXiv Detail & Related papers (2023-10-12T19:27:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.