GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity
Extraction Focused on Machine Learning Models and Datasets
- URL: http://arxiv.org/abs/2311.09860v1
- Date: Thu, 16 Nov 2023 12:43:02 GMT
- Title: GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity
Extraction Focused on Machine Learning Models and Datasets
- Authors: Wolfgang Otto, Matth\"aus Zloch, Lu Gan, Saurav Karmakar, and Stefan
Dietze
- Abstract summary: In academic writing, references to machine learning models and datasets are fundamental components.
Existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types.
We release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets.
- Score: 3.9169112083667073
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Named Entity Recognition (NER) models play a crucial role in various NLP
tasks, including information extraction (IE) and text understanding. In
academic writing, references to machine learning models and datasets are
fundamental components of various computer science publications and necessitate
accurate models for identification. Despite the advancements in NER, existing
ground truth datasets do not treat fine-grained types like ML model and model
architecture as separate entity types, and consequently, baseline models cannot
recognize them as such. In this paper, we release a corpus of 100 manually
annotated full-text scientific publications and a first baseline model for 10
entity types centered around ML models and datasets. In order to provide a
nuanced understanding of how ML models and datasets are mentioned and utilized,
our dataset also contains annotations for informal mentions like "our
BERT-based model" or "an image CNN". You can find the ground truth dataset and
code to replicate model training at https://data.gesis.org/gsap/gsap-ner.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Machine Unlearning using a Multi-GAN based Model [0.0]
This article presents a new machine unlearning approach that utilizes multiple Generative Adversarial Network (GAN) based models.
The proposed method comprises two phases: i) data reorganization in which synthetic data using the GAN model is introduced with inverted class labels of the forget datasets, and ii) fine-tuning the pre-trained model.
arXiv Detail & Related papers (2024-07-26T02:28:32Z) - Self-Regulated Data-Free Knowledge Amalgamation for Text Classification [9.169836450935724]
We develop a lightweight student network that can learn from multiple teacher models without accessing their original training data.
To accomplish this, we propose STRATANET, a modeling framework that produces text data tailored to each teacher.
We evaluate our method on three benchmark text classification datasets with varying labels or domains.
arXiv Detail & Related papers (2024-06-16T21:13:30Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - ProtoNER: Few shot Incremental Learning for Named Entity Recognition
using Prototypical Networks [7.317342506617286]
Prototypical Network based end-to-end KVP extraction model is presented.
No dependency on dataset used for initial training of the model.
No intermediate synthetic data generation which tends to add noise and results in model's performance degradation.
arXiv Detail & Related papers (2023-10-03T18:52:19Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - Fuzzy Simplicial Networks: A Topology-Inspired Model to Improve Task
Generalization in Few-shot Learning [1.0062040918634414]
Few-shot learning algorithms are designed to generalize well to new tasks with limited data.
We introduce a new few-shot model called Fuzzy Simplicial Networks (FSN) which leverages a construction from topology to more flexibly represent each class from limited data.
arXiv Detail & Related papers (2020-09-23T17:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.