The STEM-ECR Dataset: Grounding Scientific Entity References in STEM
Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- URL: http://arxiv.org/abs/2003.01006v4
- Date: Tue, 28 Jul 2020 09:45:52 GMT
- Title: The STEM-ECR Dataset: Grounding Scientific Entity References in STEM
Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- Authors: Jennifer D'Souza, Anett Hoppe, Arthur Brack, Mohamad Yaser Jaradeh,
S\"oren Auer, Ralph Ewerth
- Abstract summary: The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks.
It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform.
- Score: 8.54082916181163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the STEM (Science, Technology, Engineering, and Medicine)
Dataset for Scientific Entity Extraction, Classification, and Resolution,
version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to
provide a benchmark for the evaluation of scientific entity extraction,
classification, and resolution tasks in a domain-independent fashion. It
comprises abstracts in 10 STEM disciplines that were found to be the most
prolific ones on a major publishing platform. We describe the creation of such
a multidisciplinary corpus and highlight the obtained findings in terms of the
following features: 1) a generic conceptual formalism for scientific entities
in a multidisciplinary scientific context; 2) the feasibility of the
domain-independent human annotation of scientific entities under such a generic
formalism; 3) a performance benchmark obtainable for automatic extraction of
multidisciplinary scientific entities using BERT-based neural models; 4) a
delineated 3-step entity resolution procedure for human annotation of the
scientific entities via encyclopedic entity linking and lexicographic word
sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic
links and lexicographic senses for our entities. Our findings cumulatively
indicate that human annotation and automatic learning of multidisciplinary
scientific concepts as well as their semantic disambiguation in a wide-ranging
setting as STEM is reasonable.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - ATEM: A Topic Evolution Model for the Detection of Emerging Topics in
Scientific Archives [1.854328133293073]
ATEM is based on dynamic topic modeling and dynamic graph embedding techniques.
ATEM can efficiently detect emerging cross-disciplinary topics within the DBLP archive of over five million computer science articles.
arXiv Detail & Related papers (2023-06-04T00:32:45Z) - MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z) - SciTweets -- A Dataset and Annotation Framework for Detecting Scientific
Online Discourse [2.3371548697609303]
Scientific topics, claims and resources are increasingly debated as part of online discourse.
This has led to both significant societal impact and increased interest in scientific online discourse from various disciplines.
Research across disciplines currently suffers from a lack of robust definitions of the various forms of science-relatedness.
arXiv Detail & Related papers (2022-06-15T08:14:55Z) - An Informational Space Based Semantic Analysis for Scientific Texts [62.997667081978825]
This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts.
The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties.
The research in this paper conducts the base for the geometric representation of the meaning of texts.
arXiv Detail & Related papers (2022-05-31T11:19:32Z) - Overview of STEM Science as Process, Method, Material, and Data Named
Entities [0.0]
We develop and analyze a large-scale structured dataset of STEM articles across 10 different disciplines.
Our analysis is defined over a large-scale corpus comprising 60K abstracts structured as four scientific entities process, method, material, and data.
The STEM-NER-60k corpus, created in this work, comprises over 1M extracted entities from 60k STEM articles obtained from a major publishing platform.
arXiv Detail & Related papers (2022-05-24T07:35:24Z) - Expressing High-Level Scientific Claims with Formal Semantics [0.8258451067861932]
We analyze the main claims from a sample of scientific articles from all disciplines.
We find that their semantics are more complex than what a straight-forward application of formalisms like RDF or OWL account for.
We show here how the instantiation of the five slots of this super-pattern leads to a strictly defined statement in higher-order logic.
arXiv Detail & Related papers (2021-09-27T09:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.