Dataset Mention Extraction in Scientific Articles Using Bi-LSTM-CRF Model
- URL: http://arxiv.org/abs/2405.13135v1
- Date: Tue, 21 May 2024 18:12:37 GMT
- Title: Dataset Mention Extraction in Scientific Articles Using Bi-LSTM-CRF Model
- Authors: Tong Zeng, Daniel Acuna,
- Abstract summary: We show that citing datasets is not a common or standard practice in spite of recent efforts by data repositories and funding agencies.
A potential solution to this problem is to automatically extract dataset mentions from scientific articles.
In this work, we propose to achieve such extraction by using a neural network based on a Bi-LSTM-CRF architecture.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Datasets are critical for scientific research, playing an important role in replication, reproducibility, and efficiency. Researchers have recently shown that datasets are becoming more important for science to function properly, even serving as artifacts of study themselves. However, citing datasets is not a common or standard practice in spite of recent efforts by data repositories and funding agencies. This greatly affects our ability to track their usage and importance. A potential solution to this problem is to automatically extract dataset mentions from scientific articles. In this work, we propose to achieve such extraction by using a neural network based on a Bi-LSTM-CRF architecture. Our method achieves F1 = 0.885 in social science articles released as part of the Rich Context Dataset. We discuss the limitations of the current datasets and propose modifications to the model to be done in the future.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Why Tabular Foundation Models Should Be a Research Priority [65.75744962286538]
Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power.
We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM)
arXiv Detail & Related papers (2024-05-02T10:05:16Z) - DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models.
DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Good Data, Large Data, or No Data? Comparing Three Approaches in
Developing Research Aspect Classifiers for Biomedical Papers [19.1408856831043]
We investigate the impact of different datasets on model performance for the crowd-annotated CODA-19 research aspect classification task.
Our results indicate that using the PubMed 200K RCT dataset does not improve performance for the CODA-19 task.
arXiv Detail & Related papers (2023-06-07T22:56:53Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT [9.33544942080883]
This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science.
We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release.
We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
arXiv Detail & Related papers (2023-04-05T04:01:52Z) - Position Paper on Dataset Engineering to Accelerate Science [1.952708415083428]
In this work, we will use the token ittextdataset to designate a structured set of data built to perform a well-defined task.
Specifically, in science, each area has unique forms to organize, gather and handle its datasets.
We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets.
arXiv Detail & Related papers (2023-03-09T19:07:40Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.