PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text
- URL: http://arxiv.org/abs/2210.12401v1
- Date: Sat, 22 Oct 2022 09:43:54 GMT
- Title: PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text
- Authors: Xianjun Yang, Ya Zhuo, Julia Zuo, Xinlu Zhang, Stephen Wilson, Linda
Petzold
- Abstract summary: This dataset simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus.
We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations.
- Score: 1.9573380763700712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific action graphs extraction from materials synthesis procedures is
important for reproducible research, machine automation, and material
prediction. But the lack of annotated data has hindered progress in this field.
We demonstrate an effort to annotate Polycrystalline Materials Synthesis
Procedures (PcMSP) from 305 open access scientific articles for the
construction of synthesis action graphs. This is a new dataset for material
science information extraction that simultaneously contains the synthesis
sentences extracted from the experimental paragraphs, as well as the entity
mentions and intra-sentence relations. A two-step human annotation and
inter-annotator agreement study guarantee the high quality of the PcMSP corpus.
We introduce four natural language processing tasks: sentence classification,
named entity recognition, relation classification, and joint extraction of
entities and relations. Comprehensive experiments validate the effectiveness of
several state-of-the-art models for these challenges while leaving large space
for improvement. We also perform the error analysis and point out some unique
challenges that require further investigation. We will release our annotation
scheme, the corpus, and codes to the research community to alleviate the
scarcity of labeled data in this domain.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - An Autonomous Large Language Model Agent for Chemical Literature Data
Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature.
Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - CARE: Extracting Experimental Findings From Clinical Literature [29.763929941107616]
This work presents CARE, a new IE dataset for the task of extracting clinical findings.
We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes.
We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports.
arXiv Detail & Related papers (2023-11-16T10:06:19Z) - Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from
Literature with GPT-3 [52.59930033705221]
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
arXiv Detail & Related papers (2023-04-26T22:21:33Z) - BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination
Prediction [13.361489059744754]
BLIAM generates training data points that are interpretable and model-agnostic to downstream applications.
BLIAM can be further used to synthesize data points for novel drugs and cell lines that were not even measured in biomedical experiments.
arXiv Detail & Related papers (2023-02-14T06:48:52Z) - Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms [5.366354612549173]
We focus on data-synthesis methods to create high-quality synthetic data.
We present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data.
Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
arXiv Detail & Related papers (2022-04-08T07:48:57Z) - ULSA: Unified Language of Synthesis Actions for Representation of
Synthesis Protocols [2.436060325115753]
We propose the first Unified Language of Synthesis Actions (ULSA) for describing synthesis procedures.
We created a dataset of 3,040 synthesis procedures annotated by domain experts according to the proposed ULSA scheme.
arXiv Detail & Related papers (2022-01-23T17:44:48Z) - Extracting Fine-Grained Knowledge Graphs of Scientific Claims: Dataset
and Transformer-Based Results [0.5710971447109948]
We build SciClaim, a dataset of scientific claims drawn from Social and Behavior Science (SBS), PubMed, and CORD-19 papers.
Our novel graph annotation schema incorporates not only coarse-grained entity spans as nodes and relations as edges between them, but also fine-grained attributes that modify entities and their relations.
By including more label types and more than twice the label density of previous datasets, SciClaim captures causal, comparative, predictive, statistical, and proportional associations over experimental variables along with their qualifications, subtypes, and evidence.
arXiv Detail & Related papers (2021-09-21T22:54:09Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.