SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP
- URL: http://arxiv.org/abs/2509.07801v3
- Date: Sat, 20 Sep 2025 02:06:27 GMT
- Title: SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP
- Authors: Decheng Duan, Yingyi Zhang, Jitong Peng, Chengzhi Zhang,
- Abstract summary: SciNLP is a benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain.<n>The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations.
- Score: 3.806421007129287
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP--a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at: https://github.com/AKADDC/SciNLP.
Related papers
- Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI [4.178382980763478]
We propose a novel hybrid text-mining framework to convert unstructured scientific text into structured data.<n>Our approach first transforms raw text into entity-recognized text, and subsequently into structured form.<n>We also enhance entity recognition performance by introducing an entity marker-a simple yet effective technique that uses symbolic annotations.
arXiv Detail & Related papers (2025-05-09T07:58:30Z) - Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek [2.3499129784547663]
We evaluate the performance of open-source (Llama-70b) and closed-source (GPT-4o mini) large language models on seven core NLP tasks with dataset availability.<n>Second, we expand the scope of Greek NLP by reframing Authorship Attribution as a tool to assess potential data usage by LLMs in pre-training.<n>Third, we showcase a legal NLP case study, where a Summarize, Translate, and Embed (STE) methodology outperforms the traditional TF-IDF approach for clustering emphlong legal texts.
arXiv Detail & Related papers (2025-01-22T12:06:16Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions.
The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities.
We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z) - Surveying the Landscape of Text Summarization with Deep Learning: A
Comprehensive Review [2.4185510826808487]
Deep learning has revolutionized natural language processing (NLP) by enabling the development of models that can learn complex representations of language data.
Deep learning models for NLP typically use large amounts of data to train deep neural networks, allowing them to learn the patterns and relationships in language data.
Applying deep learning to text summarization refers to the use of deep neural networks to perform text summarization tasks.
arXiv Detail & Related papers (2023-10-13T21:24:37Z) - Modeling Multi-Granularity Hierarchical Features for Relation Extraction [26.852869800344813]
We propose a novel method to extract multi-granularity features based solely on the original input sentences.
We show that effective structured features can be attained even without external knowledge.
arXiv Detail & Related papers (2022-04-09T09:44:05Z) - WANLI: Worker and AI Collaboration for Natural Language Inference
Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration.
We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z) - An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.