HoneyBee: Progressive Instruction Finetuning of Large Language Models
for Materials Science
- URL: http://arxiv.org/abs/2310.08511v1
- Date: Thu, 12 Oct 2023 17:06:19 GMT
- Title: HoneyBee: Progressive Instruction Finetuning of Large Language Models
for Materials Science
- Authors: Yu Song, Santiago Miret, Huan Zhang, Bang Liu
- Abstract summary: We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct)
We then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee)
- Score: 36.44466740289109
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an instruction-based process for trustworthy data curation in
materials science (MatSci-Instruct), which we then apply to finetune a
LLaMa-based language model targeted for materials science (HoneyBee).
MatSci-Instruct helps alleviate the scarcity of relevant, high-quality
materials science textual data available in the open literature, and HoneyBee
is the first billion-parameter language model specialized to materials science.
In MatSci-Instruct we improve the trustworthiness of generated data by
prompting multiple commercially available large language models for generation
with an Instructor module (e.g. Chat-GPT) and verification from an independent
Verifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of
multiple tasks and measure the quality of our dataset along multiple
dimensions, including accuracy against known facts, relevance to materials
science, as well as completeness and reasonableness of the data. Moreover, we
iteratively generate more targeted instructions and instruction-data in a
finetuning-evaluation-feedback loop leading to progressively better performance
for our finetuned HoneyBee models. Our evaluation on the MatSci-NLP benchmark
shows HoneyBee's outperformance of existing language models on materials
science tasks and iterative improvement in successive stages of
instruction-data refinement. We study the quality of HoneyBee's language
modeling through automatic evaluation and analyze case studies to further
understand the model's capabilities and limitations. Our code and relevant
datasets are publicly available at
\url{https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee}.
Related papers
- Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates [57.29125360837203]
Cookbook is a framework that generates training data consisting of simple patterns over random tokens.
We find that finetuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points.
arXiv Detail & Related papers (2024-10-07T17:29:40Z) - HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models [16.567468717846676]
HoneyBee is a scalable modular framework for building multimodal oncology datasets.
It generates embeddings that capture the essential features and relationships within the raw medical data.
HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
arXiv Detail & Related papers (2024-05-13T04:35:14Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Materials Informatics Transformer: A Language Model for Interpretable
Materials Properties Prediction [6.349503549199403]
We introduce our model Materials Informatics Transformer (MatInFormer) for material property prediction.
Specifically, we introduce a novel approach that involves learning the grammar of crystallography through the tokenization of pertinent space group information.
arXiv Detail & Related papers (2023-08-30T18:34:55Z) - Self-Alignment with Instruction Backtranslation [162.02529653768096]
We present a method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions.
Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus.
arXiv Detail & Related papers (2023-08-11T17:47:54Z) - MatSci-NLP: Evaluating Scientific Language Models on Materials Science
Language Tasks Using Text-to-Schema Modeling [13.30198968869312]
MatSci-NLP is a benchmark for evaluating the performance of natural language processing (NLP) models on materials science text.
We construct the benchmark from publicly available materials science text data to encompass seven different NLP tasks.
We study various BERT-based models pretrained on different scientific text corpora on MatSci-NLP to understand the impact of pretraining strategies on understanding materials science text.
arXiv Detail & Related papers (2023-05-14T22:01:24Z) - MatSciBERT: A Materials Domain Language Model for Text Mining and
Information Extraction [13.924666106089425]
MatSciBERT is a language model trained on a large corpus of scientific literature published in the materials domain.
We show that MatSciBERT outperforms SciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction.
We also discuss some of the applications of MatSciBERT in the materials domain for extracting information.
arXiv Detail & Related papers (2021-09-30T17:35:02Z) - CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.