Related papers: HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science

HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science

URL: http://arxiv.org/abs/2310.08511v1
Date: Thu, 12 Oct 2023 17:06:19 GMT
Title: HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science
Authors: Yu Song, Santiago Miret, Huan Zhang, Bang Liu
Abstract summary: We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct) We then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee)
Score: 36.44466740289109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct), which we then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee). MatSci-Instruct helps alleviate the scarcity of relevant, high-quality materials science textual data available in the open literature, and HoneyBee is the first billion-parameter language model specialized to materials science. In MatSci-Instruct we improve the trustworthiness of generated data by prompting multiple commercially available large language models for generation with an Instructor module (e.g. Chat-GPT) and verification from an independent Verifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of multiple tasks and measure the quality of our dataset along multiple dimensions, including accuracy against known facts, relevance to materials science, as well as completeness and reasonableness of the data. Moreover, we iteratively generate more targeted instructions and instruction-data in a finetuning-evaluation-feedback loop leading to progressively better performance for our finetuned HoneyBee models. Our evaluation on the MatSci-NLP benchmark shows HoneyBee's outperformance of existing language models on materials science tasks and iterative improvement in successive stages of instruction-data refinement. We study the quality of HoneyBee's language modeling through automatic evaluation and analyze case studies to further understand the model's capabilities and limitations. Our code and relevant datasets are publicly available at \url{https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee}.

Related papers

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates [57.29125360837203]
Cookbook is a framework that generates training data consisting of simple patterns over random tokens. We find that finetuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points.
arXiv Detail & Related papers (2024-10-07T17:29:40Z)
HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models [16.567468717846676]
HoneyBee is a scalable modular framework for building multimodal oncology datasets. It generates embeddings that capture the essential features and relationships within the raw medical data. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
arXiv Detail & Related papers (2024-05-13T04:35:14Z)
An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources. We propose a data processing framework that integrates a Processing Module and an Analyzing Module. The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z)
Materials Informatics Transformer: A Language Model for Interpretable Materials Properties Prediction [6.349503549199403]
We introduce our model Materials Informatics Transformer (MatInFormer) for material property prediction. Specifically, we introduce a novel approach that involves learning the grammar of crystallography through the tokenization of pertinent space group information.
arXiv Detail & Related papers (2023-08-30T18:34:55Z)
Self-Alignment with Instruction Backtranslation [162.02529653768096]
We present a method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus.
arXiv Detail & Related papers (2023-08-11T17:47:54Z)
MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling [13.30198968869312]
MatSci-NLP is a benchmark for evaluating the performance of natural language processing (NLP) models on materials science text. We construct the benchmark from publicly available materials science text data to encompass seven different NLP tasks. We study various BERT-based models pretrained on different scientific text corpora on MatSci-NLP to understand the impact of pretraining strategies on understanding materials science text.
arXiv Detail & Related papers (2023-05-14T22:01:24Z)
MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction [13.924666106089425]
MatSciBERT is a language model trained on a large corpus of scientific literature published in the materials domain. We show that MatSciBERT outperforms SciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction. We also discuss some of the applications of MatSciBERT in the materials domain for extracting information.
arXiv Detail & Related papers (2021-09-30T17:35:02Z)
CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL) We develop CodeBERT with Transformer-based neural architecture. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.