Related papers: TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics

TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics

URL: http://arxiv.org/abs/2101.10273v1
Date: Mon, 25 Jan 2021 17:54:06 GMT
Title: TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics
Authors: Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin and Debasis Ganguly
Abstract summary: We present a new corpus that contains domain expert annotations for Task (T), dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers. We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL.
Score: 32.4845534482475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tasks, Datasets and Evaluation Metrics are important concepts for understanding experimental scientific papers. However, most previous work on information extraction for scientific literature mainly focuses on the abstracts only, and does not treat datasets as a separate type of entity (Zadeh and Schumann, 2016; Luan et al., 2018). In this paper, we present a new corpus that contains domain expert annotations for Task (T), Dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers. We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL Anthology. The corpus is made publicly available to the community for fostering research on scientific publication summarization (Erera et al., 2019) and knowledge discovery.

Related papers

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection. The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z)
All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z)
TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation [22.738731393540633]
In domains where the source text is relatively long-form, such as in scientific documents, such summary is not able to go beyond the general and coarse overview. In this paper, we propose TSTR, an extractive summarizer that utilizes the introductory information of documents as pointers to their salient information.
arXiv Detail & Related papers (2022-06-02T02:45:31Z)
LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z)
CitationIE: Leveraging the Citation Graph for Scientific Information Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers. We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles [3.0504782036247438]
We propose a topic-centric unsupervised multi-document summarization framework to generate abstractive summaries. The proposed algorithm generates an abstractive summary by developing salient language unit selection and text generation techniques. Our approach matches the state-of-the-art when evaluated on automated extractive evaluation metrics and performs better for abstractive summarization on five human evaluation metrics.
arXiv Detail & Related papers (2020-11-03T04:04:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.