SciGLM: Training Scientific Language Models with Self-Reflective
Instruction Annotation and Tuning
- URL: http://arxiv.org/abs/2401.07950v2
- Date: Tue, 12 Mar 2024 18:34:05 GMT
- Title: SciGLM: Training Scientific Language Models with Self-Reflective
Instruction Annotation and Tuning
- Authors: Dan Zhang and Ziniu Hu and Sining Zhoubian and Zhengxiao Du and Kaiyu
Yang and Zihan Wang and Yisong Yue and Yuxiao Dong and Jie Tang
- Abstract summary: SciGLM is a suite of scientific language models able to conduct college-level scientific reasoning.
We apply a self-reflective instruction annotation framework to generate step-by-step reasoning for unlabelled scientific questions.
We fine-tuned the ChatGLM family of language models with SciInstruct, enhancing their scientific and mathematical reasoning capabilities.
- Score: 60.14510984576027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown promise in assisting scientific
discovery. However, such applications are currently limited by LLMs'
deficiencies in understanding intricate scientific concepts, deriving symbolic
equations, and solving advanced numerical calculations. To bridge these gaps,
we introduce SciGLM, a suite of scientific language models able to conduct
college-level scientific reasoning. Central to our approach is a novel
self-reflective instruction annotation framework to address the data scarcity
challenge in the science domain. This framework leverages existing LLMs to
generate step-by-step reasoning for unlabelled scientific questions, followed
by a process of self-reflective critic-and-revise. Applying this framework, we
curated SciInstruct, a diverse and high-quality dataset encompassing physics,
chemistry, math, and formal proofs. We fine-tuned the ChatGLM family of
language models with SciInstruct, enhancing their scientific and mathematical
reasoning capabilities. Remarkably, the SciGLM consistently improves both the
base model (ChatGLM3-6B-Base) by 4.87% and larger-scale models (32B) by 2.67%,
without sacrificing the language understanding capabilities of the base model.
This makes SciGLM a suitable foundational model to facilitate diverse
scientific discovery tasks. For the benefit of the wider research community, we
release SciInstruct, and SciGLM, alongside a self-reflective framework and
fine-tuning code at https://github.com/THUDM/SciGLM.
Related papers
- A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
We comprehensively survey over 250 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - LLM-SR: Scientific Equation Discovery via Programming with Large Language Models [17.64574496035502]
Traditional methods of equation discovery, known as symbolic regression, largely focus on extracting equations from data alone.
We introduce LLM-SR, a novel approach that leverages the scientific knowledge and robust code generation capabilities of Large Language Models.
We demonstrate LLM-SR's effectiveness across three diverse scientific domains, where it discovers physically accurate equations.
arXiv Detail & Related papers (2024-04-29T03:30:06Z) - A Survey on Self-Evolution of Large Language Models [116.54238664264928]
Large language models (LLMs) have significantly advanced in various fields and intelligent agent applications.
To address this issue, self-evolution approaches that enable LLMs to autonomously acquire, refine, and learn from experiences generated by the model itself are rapidly growing.
arXiv Detail & Related papers (2024-04-22T17:43:23Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - An Interdisciplinary Outlook on Large Language Models for Scientific
Research [3.4108358650013573]
We describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision.
We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications.
We articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use.
arXiv Detail & Related papers (2023-11-03T19:41:09Z) - DARWIN Series: Domain Specific Large Language Models for Natural Science [20.864698325126735]
We present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science.
We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness.
DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models.
arXiv Detail & Related papers (2023-08-25T01:40:48Z) - SCITUNE: Aligning Large Language Models with Scientific Multimodal
Instructions [0.7264378254137809]
In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions.
To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune.
In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.
arXiv Detail & Related papers (2023-07-03T16:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.