DARWIN Series: Domain Specific Large Language Models for Natural Science
- URL: http://arxiv.org/abs/2308.13565v1
- Date: Fri, 25 Aug 2023 01:40:48 GMT
- Title: DARWIN Series: Domain Specific Large Language Models for Natural Science
- Authors: Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang,
Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, Bram
Hoex
- Abstract summary: We present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science.
We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness.
DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models.
- Score: 20.864698325126735
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Emerging tools bring forth fresh approaches to work, and the field of natural
science is no different. In natural science, traditional manual, serial, and
labour-intensive work is being augmented by automated, parallel, and iterative
processes driven by artificial intelligence-based experimental automation and
more. To add new capabilities in natural science, enabling the acceleration and
enrichment of automation of the discovery process, we present DARWIN, a series
of tailored LLMs for natural science, mainly in physics, chemistry, and
material science. This series relies on open-source LLM, incorporating
structured and unstructured scientific knowledge from public datasets and
literature. We fine-tuned the models using over 60,000 instruction data points,
emphasizing factual correctness. During the fine-tuning, we introduce the
Scientific Instruction Generation (SIG) model, automating instruction
generation from scientific texts. This eliminates the need for manual
extraction or domain-specific knowledge graphs and efficiently injects
scientific knowledge into the model. We also explore multi-task training
strategies, revealing interconnections between scientific tasks. DARWIN series
not only achieves state-of-the-art results on various scientific tasks but also
diminishes reliance on closed-source AI models. Our research showcases the
ability of LLM in the scientific domain, with the overarching goal of fostering
prosperity within the broader AI for science community.
Related papers
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [14.465756130099091]
This paper presents the first comprehensive framework for fully automatic scientific discovery.
We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, and describes its findings.
In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community.
arXiv Detail & Related papers (2024-08-12T16:58:11Z) - Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding [0.0]
This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains.
We employ pre-trained models and fine-tune them on datasets in the scientific domain.
arXiv Detail & Related papers (2024-08-04T01:32:09Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - The Future of Fundamental Science Led by Generative Closed-Loop
Artificial Intelligence [67.70415658080121]
Recent advances in machine learning and AI are disrupting technological innovation, product development, and society as a whole.
AI has contributed less to fundamental science in part because large data sets of high-quality data for scientific practice and model discovery are more difficult to access.
Here we explore and investigate aspects of an AI-driven, automated, closed-loop approach to scientific discovery.
arXiv Detail & Related papers (2023-07-09T21:16:56Z) - Learning from learning machines: a new generation of AI technology to
meet the needs of science [59.261050918992325]
We outline emerging opportunities and challenges to enhance the utility of AI for scientific discovery.
The distinct goals of AI for industry versus the goals of AI for science create tension between identifying patterns in data versus discovering patterns in the world from data.
arXiv Detail & Related papers (2021-11-27T00:55:21Z) - Scientific intuition inspired by machine learning generated hypotheses [2.294014185517203]
We shift the focus on the insights and the knowledge obtained by the machine learning models themselves.
We apply gradient boosting in decision trees to extract human interpretable insights from big data sets from chemistry and physics.
The ability to go beyond numerics opens the door to use machine learning to accelerate the discovery of conceptual understanding.
arXiv Detail & Related papers (2020-10-27T12:12:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.