MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in
the Materials Science Domain
- URL: http://arxiv.org/abs/2310.15569v1
- Date: Tue, 24 Oct 2023 07:23:46 GMT
- Title: MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in
the Materials Science Domain
- Authors: Timo Pierre Schrader, Matteo Finco, Stefan Gr\"unewald, Felix
Hildebrand, Annemarie Friedrich
- Abstract summary: We present MuLMS, a new dataset of 50 open-access articles, spanning seven sub-domains of materials science.
We present competitive neural models for all tasks and demonstrate that multi-task training with existing related resources leads to benefits.
- Score: 0.7947524927438001
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Keeping track of all relevant recent publications and experimental results
for a research area is a challenging task. Prior work has demonstrated the
efficacy of information extraction models in various scientific areas.
Recently, several datasets have been released for the yet understudied
materials science domain. However, these datasets focus on sub-problems such as
parsing synthesis procedures or on sub-domains, e.g., solid oxide fuel cells.
In this resource paper, we present MuLMS, a new dataset of 50 open-access
articles, spanning seven sub-domains of materials science. The corpus has been
annotated by domain experts with several layers ranging from named entities
over relations to frame structures. We present competitive neural models for
all tasks and demonstrate that multi-task training with existing related
resources leads to benefits.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language.
Structured data is crucial for innovative and systematic materials design.
The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction [23.489721319567025]
We discuss, quantify, and document challenges in automated information extraction from materials science literature.
This information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style.
We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.
arXiv Detail & Related papers (2023-10-12T14:57:24Z) - MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science
Domain [1.209268134212644]
Classifying the Argumentative Zone (AZ) has been proposed to improve processing of scholarly documents.
We present and release a new dataset of 50 manually annotated research articles.
arXiv Detail & Related papers (2023-07-05T14:55:18Z) - PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text [1.9573380763700712]
This dataset simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus.
We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations.
arXiv Detail & Related papers (2022-10-22T09:43:54Z) - The SOFC-Exp Corpus and Neural Approaches to Information Extraction in
the Materials Science Domain [11.085048329202335]
We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications.
A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition.
We present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set.
arXiv Detail & Related papers (2020-06-04T17:49:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.