TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of
Tasks Datasets and Metrics
- URL: http://arxiv.org/abs/2101.10273v1
- Date: Mon, 25 Jan 2021 17:54:06 GMT
- Title: TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of
Tasks Datasets and Metrics
- Authors: Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin and Debasis
Ganguly
- Abstract summary: We present a new corpus that contains domain expert annotations for Task (T), dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers.
We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL.
- Score: 32.4845534482475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tasks, Datasets and Evaluation Metrics are important concepts for
understanding experimental scientific papers. However, most previous work on
information extraction for scientific literature mainly focuses on the
abstracts only, and does not treat datasets as a separate type of entity (Zadeh
and Schumann, 2016; Luan et al., 2018). In this paper, we present a new corpus
that contains domain expert annotations for Task (T), Dataset (D), Metric (M)
entities on 2,000 sentences extracted from NLP papers. We report experiment
results on TDM extraction using a simple data augmentation strategy and apply
our tagger to around 30,000 NLP papers from the ACL Anthology. The corpus is
made publicly available to the community for fostering research on scientific
publication summarization (Erera et al., 2019) and knowledge discovery.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - TSTR: Too Short to Represent, Summarize with Details! Intro-Guided
Extended Summary Generation [22.738731393540633]
In domains where the source text is relatively long-form, such as in scientific documents, such summary is not able to go beyond the general and coarse overview.
In this paper, we propose TSTR, an extractive summarizer that utilizes the introductory information of documents as pointers to their salient information.
arXiv Detail & Related papers (2022-06-02T02:45:31Z) - LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval.
Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information.
This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Topic-Centric Unsupervised Multi-Document Summarization of Scientific
and News Articles [3.0504782036247438]
We propose a topic-centric unsupervised multi-document summarization framework to generate abstractive summaries.
The proposed algorithm generates an abstractive summary by developing salient language unit selection and text generation techniques.
Our approach matches the state-of-the-art when evaluated on automated extractive evaluation metrics and performs better for abstractive summarization on five human evaluation metrics.
arXiv Detail & Related papers (2020-11-03T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.