MatSci-NLP: Evaluating Scientific Language Models on Materials Science
Language Tasks Using Text-to-Schema Modeling
- URL: http://arxiv.org/abs/2305.08264v1
- Date: Sun, 14 May 2023 22:01:24 GMT
- Title: MatSci-NLP: Evaluating Scientific Language Models on Materials Science
Language Tasks Using Text-to-Schema Modeling
- Authors: Yu Song, Santiago Miret, Bang Liu
- Abstract summary: MatSci-NLP is a benchmark for evaluating the performance of natural language processing (NLP) models on materials science text.
We construct the benchmark from publicly available materials science text data to encompass seven different NLP tasks.
We study various BERT-based models pretrained on different scientific text corpora on MatSci-NLP to understand the impact of pretraining strategies on understanding materials science text.
- Score: 13.30198968869312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MatSci-NLP, a natural language benchmark for evaluating the
performance of natural language processing (NLP) models on materials science
text. We construct the benchmark from publicly available materials science text
data to encompass seven different NLP tasks, including conventional NLP tasks
like named entity recognition and relation classification, as well as NLP tasks
specific to materials science, such as synthesis action retrieval which relates
to creating synthesis procedures for materials. We study various BERT-based
models pretrained on different scientific text corpora on MatSci-NLP to
understand the impact of pretraining strategies on understanding materials
science text. Given the scarcity of high-quality annotated data in the
materials science domain, we perform our fine-tuning experiments with limited
training data to encourage the generalize across MatSci-NLP tasks. Our
experiments in this low-resource training setting show that language models
pretrained on scientific text outperform BERT trained on general text. MatBERT,
a model pretrained specifically on materials science journals, generally
performs best for most tasks. Moreover, we propose a unified text-to-schema for
multitask learning on \benchmark and compare its performance with traditional
fine-tuning methods. In our analysis of different training methods, we find
that our proposed text-to-schema methods inspired by question-answering
consistently outperform single and multitask NLP fine-tuning methods. The code
and datasets are publicly available at
\url{https://github.com/BangLab-UdeM-Mila/NLP4MatSci-ACL23}.
Related papers
- From Tokens to Materials: Leveraging Language Models for Scientific Discovery [12.211984932142537]
This study investigates the application of language model embeddings to enhance material property prediction in materials science.
We demonstrate that domain-specific models, particularly MatBERT, significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties.
arXiv Detail & Related papers (2024-10-21T16:31:23Z) - MatText: Do Language Models Need More than Text & Scale for Materials Modeling? [5.561723952524538]
MatText is a suite of benchmarking tools and datasets designed to systematically evaluate the performance of language models in modeling materials.
MatText provides essential tools for training and benchmarking the performance of language models in the context of materials science.
arXiv Detail & Related papers (2024-06-25T05:45:07Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Annotated Dataset Creation through General Purpose Language Models for
non-English Medical NLP [0.5482532589225552]
In our work we suggest to leverage pretrained language models for training data acquisition.
We create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED.
arXiv Detail & Related papers (2022-08-30T18:42:55Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.