MatSciBERT: A Materials Domain Language Model for Text Mining and
Information Extraction
- URL: http://arxiv.org/abs/2109.15290v1
- Date: Thu, 30 Sep 2021 17:35:02 GMT
- Title: MatSciBERT: A Materials Domain Language Model for Text Mining and
Information Extraction
- Authors: Tanishq Gupta, Mohd Zaki, N. M. Anoop Krishnan, Mausam
- Abstract summary: MatSciBERT is a language model trained on a large corpus of scientific literature published in the materials domain.
We show that MatSciBERT outperforms SciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction.
We also discuss some of the applications of MatSciBERT in the materials domain for extracting information.
- Score: 13.924666106089425
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: An overwhelmingly large amount of knowledge in the materials domain is
generated and stored as text published in peer-reviewed scientific literature.
Recent developments in natural language processing, such as bidirectional
encoder representations from transformers (BERT) models, provide promising
tools to extract information from these texts. However, direct application of
these models in the materials domain may yield suboptimal results as the models
themselves may not be trained on notations and jargon that are specific to the
domain. Here, we present a materials-aware language model, namely, MatSciBERT,
which is trained on a large corpus of scientific literature published in the
materials domain. We further evaluate the performance of MatSciBERT on three
downstream tasks, namely, abstract classification, named entity recognition,
and relation extraction, on different materials datasets. We show that
MatSciBERT outperforms SciBERT, a language model trained on science corpus, on
all the tasks. Further, we discuss some of the applications of MatSciBERT in
the materials domain for extracting information, which can, in turn, contribute
to materials discovery or optimization. Finally, to make the work accessible to
the larger materials community, we make the pretrained and finetuned weights
and the models of MatSciBERT freely accessible.
Related papers
- From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language.
Structured data is crucial for innovative and systematic materials design.
The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z) - MatText: Do Language Models Need More than Text & Scale for Materials Modeling? [5.561723952524538]
MatText is a suite of benchmarking tools and datasets designed to systematically evaluate the performance of language models in modeling materials.
MatText provides essential tools for training and benchmarking the performance of language models in the context of materials science.
arXiv Detail & Related papers (2024-06-25T05:45:07Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - HoneyBee: Progressive Instruction Finetuning of Large Language Models
for Materials Science [36.44466740289109]
We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct)
We then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee)
arXiv Detail & Related papers (2023-10-12T17:06:19Z) - Adapting Large Language Models to Domains via Reading Comprehension [86.24451681746676]
We explore how continued pre-training on domain-specific corpora influences large language models.
We show that training on the raw corpora endows the model with domain knowledge, but drastically hurts its ability for question answering.
We propose a simple method for transforming raw corpora into reading comprehension texts.
arXiv Detail & Related papers (2023-09-18T07:17:52Z) - Materials Informatics Transformer: A Language Model for Interpretable
Materials Properties Prediction [6.349503549199403]
We introduce our model Materials Informatics Transformer (MatInFormer) for material property prediction.
Specifically, we introduce a novel approach that involves learning the grammar of crystallography through the tokenization of pertinent space group information.
arXiv Detail & Related papers (2023-08-30T18:34:55Z) - MatSci-NLP: Evaluating Scientific Language Models on Materials Science
Language Tasks Using Text-to-Schema Modeling [13.30198968869312]
MatSci-NLP is a benchmark for evaluating the performance of natural language processing (NLP) models on materials science text.
We construct the benchmark from publicly available materials science text data to encompass seven different NLP tasks.
We study various BERT-based models pretrained on different scientific text corpora on MatSci-NLP to understand the impact of pretraining strategies on understanding materials science text.
arXiv Detail & Related papers (2023-05-14T22:01:24Z) - Sparse*BERT: Sparse Models Generalize To New tasks and Domains [79.42527716035879]
This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks.
We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text.
arXiv Detail & Related papers (2022-05-25T02:51:12Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.