Overview of STEM Science as Process, Method, Material, and Data Named
Entities
- URL: http://arxiv.org/abs/2205.11863v1
- Date: Tue, 24 May 2022 07:35:24 GMT
- Title: Overview of STEM Science as Process, Method, Material, and Data Named
Entities
- Authors: Jennifer D'Souza
- Abstract summary: We develop and analyze a large-scale structured dataset of STEM articles across 10 different disciplines.
Our analysis is defined over a large-scale corpus comprising 60K abstracts structured as four scientific entities process, method, material, and data.
The STEM-NER-60k corpus, created in this work, comprises over 1M extracted entities from 60k STEM articles obtained from a major publishing platform.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We are faced with an unprecedented production in scholarly publications
worldwide. Stakeholders in the digital libraries posit that the document-based
publishing paradigm has reached the limits of adequacy. Instead, structured,
machine-interpretable, fine-grained scholarly knowledge publishing as Knowledge
Graphs (KG) is strongly advocated. In this work, we develop and analyze a
large-scale structured dataset of STEM articles across 10 different
disciplines, viz. Agriculture, Astronomy, Biology, Chemistry, Computer Science,
Earth Science, Engineering, Material Science, Mathematics, and Medicine. Our
analysis is defined over a large-scale corpus comprising 60K abstracts
structured as four scientific entities process, method, material, and data.
Thus our study presents, for the first-time, an analysis of a large-scale
multidisciplinary corpus under the construct of four named entity labels that
are specifically defined and selected to be domain-independent as opposed to
domain-specific. The work is then inadvertently a feasibility test of
characterizing multidisciplinary science with domain-independent concepts.
Further, to summarize the distinct facets of scientific knowledge per concept
per discipline, a set of word cloud visualizations are offered. The
STEM-NER-60k corpus, created in this work, comprises over 1M extracted entities
from 60k STEM articles obtained from a major publishing platform and is
publicly released https://github.com/jd-coderepos/stem-ner-60k.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - Ontology Embedding: A Survey of Methods, Applications and Resources [54.3453925775069]
Ontologies are widely used for representing domain knowledge and meta data.
One straightforward solution is to integrate statistical analysis and machine learning.
Numerous papers have been published on embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field.
arXiv Detail & Related papers (2024-06-16T14:49:19Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Cyber-Security Knowledge Graph Generation by Hierarchical Nonnegative Matrix Factorization [8.158794536515245]
Much of human knowledge in cybersecurity is encapsulated within the ever-growing volume of scientific papers.
Knowledge Graphs (KGs) serve as a means to store factual information in a structured manner.
One of the challenges in constructing a KG from scientific literature is the extraction of ontology from unstructured text.
arXiv Detail & Related papers (2024-03-24T16:30:05Z) - Bridging Research and Readers: A Multi-Modal Automated Academic Papers
Interpretation System [47.13932723910289]
We introduce an open-source multi-modal automated academic paper interpretation system (MMAPIS) with three-step process stages.
It employs the hybrid modality preprocessing and alignment module to extract plain text, and tables or figures from documents separately.
It then aligns this information based on the section names they belong to, ensuring that data with identical section names are categorized under the same section.
It utilizes the extracted section names to divide the article into shorter text segments, facilitating specific summarizations both within and between sections via LLMs.
arXiv Detail & Related papers (2024-01-17T11:50:53Z) - MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in
the Materials Science Domain [0.7947524927438001]
We present MuLMS, a new dataset of 50 open-access articles, spanning seven sub-domains of materials science.
We present competitive neural models for all tasks and demonstrate that multi-task training with existing related resources leads to benefits.
arXiv Detail & Related papers (2023-10-24T07:23:46Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - SMAuC -- The Scientific Multi-Authorship Corpus [32.77279821297011]
We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorship analysis.
Comprising over 3 million publications across various disciplines from over 5 million authors, SMAuC is the largest openly accessible corpus for this purpose.
arXiv Detail & Related papers (2022-11-04T14:07:17Z) - The STEM-ECR Dataset: Grounding Scientific Entity References in STEM
Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources [8.54082916181163]
The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks.
It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform.
arXiv Detail & Related papers (2020-03-02T16:35:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.