DARWIN Series: Domain Specific Large Language Models for Natural Science
- URL: http://arxiv.org/abs/2308.13565v1
- Date: Fri, 25 Aug 2023 01:40:48 GMT
- Title: DARWIN Series: Domain Specific Large Language Models for Natural Science
- Authors: Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang,
Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, Bram
Hoex
- Abstract summary: We present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science.
We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness.
DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models.
- Score: 20.864698325126735
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Emerging tools bring forth fresh approaches to work, and the field of natural
science is no different. In natural science, traditional manual, serial, and
labour-intensive work is being augmented by automated, parallel, and iterative
processes driven by artificial intelligence-based experimental automation and
more. To add new capabilities in natural science, enabling the acceleration and
enrichment of automation of the discovery process, we present DARWIN, a series
of tailored LLMs for natural science, mainly in physics, chemistry, and
material science. This series relies on open-source LLM, incorporating
structured and unstructured scientific knowledge from public datasets and
literature. We fine-tuned the models using over 60,000 instruction data points,
emphasizing factual correctness. During the fine-tuning, we introduce the
Scientific Instruction Generation (SIG) model, automating instruction
generation from scientific texts. This eliminates the need for manual
extraction or domain-specific knowledge graphs and efficiently injects
scientific knowledge into the model. We also explore multi-task training
strategies, revealing interconnections between scientific tasks. DARWIN series
not only achieves state-of-the-art results on various scientific tasks but also
diminishes reliance on closed-source AI models. Our research showcases the
ability of LLM in the scientific domain, with the overarching goal of fostering
prosperity within the broader AI for science community.
Related papers
- A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
We comprehensively survey over 250 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - SciGLM: Training Scientific Language Models with Self-Reflective
Instruction Annotation and Tuning [60.14510984576027]
SciGLM is a suite of scientific language models able to conduct college-level scientific reasoning.
We apply a self-reflective instruction annotation framework to generate step-by-step reasoning for unlabelled scientific questions.
We fine-tuned the ChatGLM family of language models with SciInstruct, enhancing their scientific and mathematical reasoning capabilities.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - The Future of Fundamental Science Led by Generative Closed-Loop
Artificial Intelligence [67.70415658080121]
Recent advances in machine learning and AI are disrupting technological innovation, product development, and society as a whole.
AI has contributed less to fundamental science in part because large data sets of high-quality data for scientific practice and model discovery are more difficult to access.
Here we explore and investigate aspects of an AI-driven, automated, closed-loop approach to scientific discovery.
arXiv Detail & Related papers (2023-07-09T21:16:56Z) - Automated Scientific Discovery: From Equation Discovery to Autonomous
Discovery Systems [5.7923858184309385]
The paper surveys automated scientific discovery, from equation discovery and symbolic regression to autonomous discovery systems and agents.
We will present closed-loop scientific discovery systems, starting with the pioneering work on the Adam system up to current efforts in fields from material science to astronomy.
The maximal level, level five, is defined to require no human intervention at all in the production of scientific knowledge.
arXiv Detail & Related papers (2023-05-03T16:35:41Z) - Learning from learning machines: a new generation of AI technology to
meet the needs of science [59.261050918992325]
We outline emerging opportunities and challenges to enhance the utility of AI for scientific discovery.
The distinct goals of AI for industry versus the goals of AI for science create tension between identifying patterns in data versus discovering patterns in the world from data.
arXiv Detail & Related papers (2021-11-27T00:55:21Z) - Scientific intuition inspired by machine learning generated hypotheses [2.294014185517203]
We shift the focus on the insights and the knowledge obtained by the machine learning models themselves.
We apply gradient boosting in decision trees to extract human interpretable insights from big data sets from chemistry and physics.
The ability to go beyond numerics opens the door to use machine learning to accelerate the discovery of conceptual understanding.
arXiv Detail & Related papers (2020-10-27T12:12:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.