The SOFC-Exp Corpus and Neural Approaches to Information Extraction in
the Materials Science Domain
- URL: http://arxiv.org/abs/2006.03039v1
- Date: Thu, 4 Jun 2020 17:49:34 GMT
- Title: The SOFC-Exp Corpus and Neural Approaches to Information Extraction in
the Materials Science Domain
- Authors: Annemarie Friedrich and Heike Adel and Federico Tomazic and Johannes
Hingerl and Renou Benteau and Anika Maruscyk and Lukas Lange
- Abstract summary: We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications.
A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition.
We present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set.
- Score: 11.085048329202335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new challenging information extraction task in the
domain of materials science. We develop an annotation scheme for marking
information on experiments related to solid oxide fuel cells in scientific
publications, such as involved materials and measurement conditions. With this
paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus
consisting of 45 open-access scholarly articles annotated by domain experts. A
corpus and an inter-annotator agreement study demonstrate the complexity of the
suggested named entity recognition and slot filling tasks as well as high
annotation quality. We also present strong neural-network based models for a
variety of tasks that can be addressed on the basis of our new data set. On all
tasks, using BERT embeddings leads to large performance gains, but with
increasing task complexity, adding a recurrent neural network on top seems
beneficial. Our models will serve as competitive baselines in future work, and
analysis of their performance highlights difficult cases when modeling the data
and suggests promising research directions.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Artificial Intuition: Efficient Classification of Scientific Abstracts [42.299140272218274]
Short scientific texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation.
To address this gap, we have developed a novel approach to generate and appropriately assign coarse domain-specific labels.
We show that a Large Language Model (LLM) can provide metadata essential to the task, in a process akin to the augmentation of supplemental knowledge.
arXiv Detail & Related papers (2024-07-08T16:34:47Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - CARE: Extracting Experimental Findings From Clinical Literature [29.763929941107616]
This work presents CARE, a new IE dataset for the task of extracting clinical findings.
We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes.
We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports.
arXiv Detail & Related papers (2023-11-16T10:06:19Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in
the Materials Science Domain [0.7947524927438001]
We present MuLMS, a new dataset of 50 open-access articles, spanning seven sub-domains of materials science.
We present competitive neural models for all tasks and demonstrate that multi-task training with existing related resources leads to benefits.
arXiv Detail & Related papers (2023-10-24T07:23:46Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality.
We also create synthetic datasets which are more natural, resembling real world document-summary pairs.
Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.