Related papers: DARWIN Series: Domain Specific Large Language Models for Natural Science

DARWIN Series: Domain Specific Large Language Models for Natural Science

URL: http://arxiv.org/abs/2308.13565v1
Date: Fri, 25 Aug 2023 01:40:48 GMT
Title: DARWIN Series: Domain Specific Large Language Models for Natural Science
Authors: Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang, Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, Bram Hoex
Abstract summary: We present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science. We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness. DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models.
Score: 20.864698325126735
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Emerging tools bring forth fresh approaches to work, and the field of natural science is no different. In natural science, traditional manual, serial, and labour-intensive work is being augmented by automated, parallel, and iterative processes driven by artificial intelligence-based experimental automation and more. To add new capabilities in natural science, enabling the acceleration and enrichment of automation of the discovery process, we present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science. This series relies on open-source LLM, incorporating structured and unstructured scientific knowledge from public datasets and literature. We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness. During the fine-tuning, we introduce the Scientific Instruction Generation (SIG) model, automating instruction generation from scientific texts. This eliminates the need for manual extraction or domain-specific knowledge graphs and efficiently injects scientific knowledge into the model. We also explore multi-task training strategies, revealing interconnections between scientific tasks. DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models. Our research showcases the ability of LLM in the scientific domain, with the overarching goal of fostering prosperity within the broader AI for science community.

Related papers

Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery [35.888956949646]
Large Language Models (LLMs) are transforming scientific research by reshaping the scientific method.<n>LLMs are involved in experimental design, data analysis, and enhancing productivity, particularly in chemistry and biology.<n>Transition to AI-driven science raises ethical questions about creativity, oversight, and responsibility.
arXiv Detail & Related papers (2025-05-22T10:05:48Z)
From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery [43.31110556077432]
Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery.<n>This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science.
arXiv Detail & Related papers (2025-05-19T15:41:32Z)
Scaling Laws in Scientific Discovery with AI and Robot Scientists [72.3420699173245]
An autonomous generalist scientist (AGS) concept combines agentic AI and embodied robotics to automate the entire research lifecycle. AGS aims to significantly reduce the time and resources needed for scientific discovery. As these autonomous systems become increasingly integrated into the research process, we hypothesize that scientific discovery might adhere to new scaling laws.
arXiv Detail & Related papers (2025-03-28T14:00:27Z)
Nature Language Model: Deciphering the Language of Nature for Scientific Discovery [105.55751854768297]
Foundation models have revolutionized natural language processing and artificial intelligence. We introduce Nature Language Model (NatureLM), a sequence-based science foundation model for scientific discovery.
arXiv Detail & Related papers (2025-02-11T13:08:03Z)
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System [62.832818186789545]
Virtual Scientists (VirSci) is a multi-agent system designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. We show that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z)
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [14.465756130099091]
This paper presents the first comprehensive framework for fully automatic scientific discovery. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, and describes its findings. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community.
arXiv Detail & Related papers (2024-08-12T16:58:11Z)
Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding [0.0]
This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains. We employ pre-trained models and fine-tune them on datasets in the scientific domain.
arXiv Detail & Related papers (2024-08-04T01:32:09Z)
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled. We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z)
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework. We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z)
Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z)
SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning. We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs. To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z)
The Future of Fundamental Science Led by Generative Closed-Loop Artificial Intelligence [67.70415658080121]
Recent advances in machine learning and AI are disrupting technological innovation, product development, and society as a whole. AI has contributed less to fundamental science in part because large data sets of high-quality data for scientific practice and model discovery are more difficult to access. Here we explore and investigate aspects of an AI-driven, automated, closed-loop approach to scientific discovery.
arXiv Detail & Related papers (2023-07-09T21:16:56Z)
Learning from learning machines: a new generation of AI technology to meet the needs of science [59.261050918992325]
We outline emerging opportunities and challenges to enhance the utility of AI for scientific discovery. The distinct goals of AI for industry versus the goals of AI for science create tension between identifying patterns in data versus discovering patterns in the world from data.
arXiv Detail & Related papers (2021-11-27T00:55:21Z)
Scientific intuition inspired by machine learning generated hypotheses [2.294014185517203]
We shift the focus on the insights and the knowledge obtained by the machine learning models themselves. We apply gradient boosting in decision trees to extract human interpretable insights from big data sets from chemistry and physics. The ability to go beyond numerics opens the door to use machine learning to accelerate the discovery of conceptual understanding.
arXiv Detail & Related papers (2020-10-27T12:12:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.