A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
- URL: http://arxiv.org/abs/2508.21148v1
- Date: Thu, 28 Aug 2025 18:30:52 GMT
- Title: A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
- Authors: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou,
- Abstract summary: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
- Score: 221.34650992288505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
Related papers
- Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration [63.61423859450929]
This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses.<n>We identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery.
arXiv Detail & Related papers (2026-01-20T18:46:42Z) - WildSci: Advancing Scientific Reasoning from In-the-Wild Literature [50.16160754134139]
We introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature.<n>By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals.<n>Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach.
arXiv Detail & Related papers (2026-01-09T06:35:23Z) - Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team [53.38438460574943]
IDVSCI is a multi-agent framework built on large language models (LLMs)<n>It incorporates two key innovations: a Dynamic Knowledge Exchange mechanism and a Dual-Diversity Review paradigm.<n>Results show that IDVSCI consistently achieves the best performance across two datasets.
arXiv Detail & Related papers (2025-06-23T07:12:08Z) - SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models [35.839640555805374]
SciCUEval is a benchmark dataset tailored to assess the scientific context understanding capability of Large Language Models (LLMs)<n>It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts.<n>It systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats.
arXiv Detail & Related papers (2025-05-21T04:33:26Z) - From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery [67.07598263346591]
Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery.<n>This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science.
arXiv Detail & Related papers (2025-05-19T15:41:32Z) - Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents [11.74019905854637]
Large language models (LLMs) are evolving into scientific agents that automate critical tasks.<n>Unlike general-purpose LLMs, specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms.<n>We highlight why they differ from general agents and the ways in which they advance research across various scientific fields.
arXiv Detail & Related papers (2025-03-31T13:11:28Z) - BLADE: Benchmarking Language Model Agents for Data-Driven Science [18.577658530714505]
LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science.
We present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions.
arXiv Detail & Related papers (2024-08-19T02:59:35Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - End-to-end Phase Field Model Discovery Combining Experimentation,
Crowdsourcing, Simulation and Learning [9.763339269757227]
We present Phase-Field-Lab platform for end-to-end phase field model discovery.
Phase-Field-Lab combines (i) a streamlined annotation tool which reduces the annotation time; (ii) an end-to-end neural model which automatically learns phase field models from data; and (iii) novel interfaces and visualizations.
Our platform is deployed in the analysis of nano-structure evolution in materials under extreme conditions.
arXiv Detail & Related papers (2023-09-13T22:44:04Z) - When Geoscience Meets Foundation Models: Towards General Geoscience Artificial Intelligence System [6.445323648941926]
Geoscience foundation models (GFMs) are a paradigm-shifting solution, integrating extensive cross-disciplinary data to enhance the simulation and understanding of Earth system dynamics.
The unique strengths of GFMs include flexible task specification, diverse input-output capabilities, and multi-modal knowledge representation.
This review offers a comprehensive overview of emerging geoscientific research paradigms, emphasizing the untapped opportunities at the intersection of advanced AI techniques and geoscience.
arXiv Detail & Related papers (2023-09-13T08:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.