SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models
- URL: http://arxiv.org/abs/2401.07950v3
- Date: Mon, 18 Nov 2024 05:30:50 GMT
- Title: SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models
- Authors: Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, Jie Tang,
- Abstract summary: We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
- Score: 57.96527452844273
- License:
- Abstract: Large Language Models (LLMs) have shown promise in assisting scientific discovery. However, such applications are currently limited by LLMs' deficiencies in understanding intricate scientific concepts, deriving symbolic equations, and solving advanced numerical calculations. To bridge these gaps, we introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs. We analyze the curated SciInstruct from multiple interesting perspectives (e.g., domain, scale, source, question type, answer length, etc.). To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath, enhancing their scientific and mathematical reasoning capabilities, without sacrificing the language understanding capabilities of the base model. We release all codes and SciInstruct at https://github.com/THUDM/SciGLM.
Related papers
- Artificial Scientific Discovery [5.241773225218436]
This thesis spans from AlphaGo to ChatGPT to examine the concepts needed to realize the vision of an artificial scientist.
An artificial scientist must develop its own interpretation of the language used to explain its findings.
This perspective leads us to see modern multimodal models as interpreters, and to devise a new way to build interpretable and cost-effective CLIP-like models.
arXiv Detail & Related papers (2024-11-18T15:51:45Z) - Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models [20.648157071328807]
Large language models (LLMs) can identify novel research directions by analyzing existing knowledge.
LLMs are prone to generating hallucinations'', outputs that are plausible-sounding but factually incorrect.
We propose KG-CoI, a system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs.
arXiv Detail & Related papers (2024-11-04T18:50:00Z) - Learning Beyond Pattern Matching? Assaying Mathematical Understanding in LLMs [58.09253149867228]
This paper assesses the domain knowledge of LLMs through its understanding of different mathematical skills required to solve problems.
Motivated by the use of LLMs as a general scientific assistant, we propose textitNTKEval to assess changes in LLM's probability distribution.
Our systematic analysis finds evidence of domain understanding during in-context learning.
Certain instruction-tuning leads to similar performance changes irrespective of training on different data, suggesting a lack of domain understanding across different skills.
arXiv Detail & Related papers (2024-05-24T12:04:54Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - SciAgent: Tool-augmented Language Models for Scientific Reasoning [129.51442677710452]
We introduce a new task setting named tool-augmented scientific reasoning.
This setting supplements Large Language Models with scalable toolsets.
We construct a tool-augmented training corpus named MathFunc which encompasses over 30,000 samples and roughly 6,000 tools.
Building on MathFunc, we develop SciAgent to retrieve, understand and, if necessary, use tools for scientific problem solving.
arXiv Detail & Related papers (2024-02-18T04:19:44Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - Large Language Models for Scientific Synthesis, Inference and
Explanation [56.41963802804953]
We show how large language models can perform scientific synthesis, inference, and explanation.
We show that the large language model can augment this "knowledge" by synthesizing from the scientific literature.
This approach has the further advantage that the large language model can explain the machine learning system's predictions.
arXiv Detail & Related papers (2023-10-12T02:17:59Z) - DARWIN Series: Domain Specific Large Language Models for Natural Science [20.864698325126735]
We present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science.
We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness.
DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models.
arXiv Detail & Related papers (2023-08-25T01:40:48Z) - SCITUNE: Aligning Large Language Models with Scientific Multimodal
Instructions [0.7264378254137809]
In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions.
To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune.
In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.
arXiv Detail & Related papers (2023-07-03T16:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.