SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions
- URL: http://arxiv.org/abs/2406.14756v1
- Date: Thu, 20 Jun 2024 22:03:21 GMT
- Title: SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions
- Authors: Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, Longin Jan Latecki,
- Abstract summary: We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
- Score: 52.35520385083425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus's scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus's utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Can Large Language Models Detect Misinformation in Scientific News
Reporting? [1.0344642971058586]
This paper investigates whether it is possible to use large language models (LLMs) to detect misinformation in scientific reporting.
We first present a new labeled dataset SciNews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources.
We identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation.
arXiv Detail & Related papers (2024-02-22T04:07:00Z) - SMAuC -- The Scientific Multi-Authorship Corpus [32.77279821297011]
We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorship analysis.
Comprising over 3 million publications across various disciplines from over 5 million authors, SMAuC is the largest openly accessible corpus for this purpose.
arXiv Detail & Related papers (2022-11-04T14:07:17Z) - Modeling Information Change in Science Communication with Semantically
Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change.
SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.
Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of
Tasks Datasets and Metrics [32.4845534482475]
We present a new corpus that contains domain expert annotations for Task (T), dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers.
We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL.
arXiv Detail & Related papers (2021-01-25T17:54:06Z) - Fact or Fiction: Verifying Scientific Claims [53.29101835904273]
We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUPPORTS or REFUTES a given scientific claim.
We construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales.
We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus.
arXiv Detail & Related papers (2020-04-30T17:22:57Z) - The STEM-ECR Dataset: Grounding Scientific Entity References in STEM
Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources [8.54082916181163]
The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks.
It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform.
arXiv Detail & Related papers (2020-03-02T16:35:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.