SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents
- URL: http://arxiv.org/abs/2410.21155v1
- Date: Mon, 28 Oct 2024 15:56:49 GMT
- Title: SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents
- Authors: Qi Zhang, Zhijia Chen, Huitong Pan, Cornelia Caragea, Longin Jan Latecki, Eduard Dragut,
- Abstract summary: We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
- Score: 49.54155332262579
- License:
- Abstract: Scientific information extraction (SciIE) is critical for converting unstructured knowledge from scholarly articles into structured data (entities and relations). Several datasets have been proposed for training and validating SciIE models. However, due to the high complexity and cost of annotating scientific texts, those datasets restrict their annotations to specific parts of paper, such as abstracts, resulting in the loss of diverse entity mentions and relations in context. In this paper, we release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations. To capture the intricate use and interactions among entities in full texts, our dataset contains a fine-grained tag set for relations. Additionally, we provide an out-of-distribution test set to offer a more realistic evaluation. We conduct comprehensive experiments, including state-of-the-art supervised models and our proposed LLM-based baselines, and highlight the challenges presented by our dataset, encouraging the development of innovative models to further the field of SciIE.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - CARE: Extracting Experimental Findings From Clinical Literature [29.763929941107616]
This work presents CARE, a new IE dataset for the task of extracting clinical findings.
We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes.
We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports.
arXiv Detail & Related papers (2023-11-16T10:06:19Z) - DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based
Queries [2.4816250611120547]
We propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE)
For each complex query, we assembled a collection of 100 relevant documents and produced annotated relevance scores for ranking them.
Anno-GPT is a framework for validating the performance of Large Language Models (LLMs) on expert-level dataset annotation tasks.
arXiv Detail & Related papers (2023-10-07T03:25:06Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of
Tasks Datasets and Metrics [32.4845534482475]
We present a new corpus that contains domain expert annotations for Task (T), dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers.
We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL.
arXiv Detail & Related papers (2021-01-25T17:54:06Z) - Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality.
We also create synthetic datasets which are more natural, resembling real world document-summary pairs.
Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z) - Method and Dataset Entity Mining in Scientific Literature: A CNN +
Bi-LSTM Model with Self-attention [21.93889297841459]
We propose a novel entity recognition model, called MDER, which is able to effectively extract the method and dataset entities from scientific papers.
We evaluate the proposed model on datasets constructed from the published papers of four research areas in computer science, i.e., NLP, CV, Data Mining and AI.
arXiv Detail & Related papers (2020-10-26T13:38:43Z) - CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers
for Analyzing Data Analysis [33.190021245507445]
Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process.
We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments.
We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplieds and outperforms a suite of baselines.
arXiv Detail & Related papers (2020-08-28T19:57:49Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.