GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity
Extraction Focused on Machine Learning Models and Datasets
- URL: http://arxiv.org/abs/2311.09860v1
- Date: Thu, 16 Nov 2023 12:43:02 GMT
- Title: GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity
Extraction Focused on Machine Learning Models and Datasets
- Authors: Wolfgang Otto, Matth\"aus Zloch, Lu Gan, Saurav Karmakar, and Stefan
Dietze
- Abstract summary: In academic writing, references to machine learning models and datasets are fundamental components.
Existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types.
We release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets.
- Score: 3.9169112083667073
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Named Entity Recognition (NER) models play a crucial role in various NLP
tasks, including information extraction (IE) and text understanding. In
academic writing, references to machine learning models and datasets are
fundamental components of various computer science publications and necessitate
accurate models for identification. Despite the advancements in NER, existing
ground truth datasets do not treat fine-grained types like ML model and model
architecture as separate entity types, and consequently, baseline models cannot
recognize them as such. In this paper, we release a corpus of 100 manually
annotated full-text scientific publications and a first baseline model for 10
entity types centered around ML models and datasets. In order to provide a
nuanced understanding of how ML models and datasets are mentioned and utilized,
our dataset also contains annotations for informal mentions like "our
BERT-based model" or "an image CNN". You can find the ground truth dataset and
code to replicate model training at https://data.gesis.org/gsap/gsap-ner.
Related papers
- Self-Regulated Data-Free Knowledge Amalgamation for Text Classification [9.169836450935724]
We develop a lightweight student network that can learn from multiple teacher models without accessing their original training data.
To accomplish this, we propose STRATANET, a modeling framework that produces text data tailored to each teacher.
We evaluate our method on three benchmark text classification datasets with varying labels or domains.
arXiv Detail & Related papers (2024-06-16T21:13:30Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - FLIP: Towards Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models (FLIP) for click-through rate (CTR) prediction.
Specifically, the masked data of one modality (i.e., tokens or features) has to be recovered with the help of the other modality, which establishes the feature-level interaction and alignment.
Experiments on three real-world datasets demonstrate that FLIP outperforms SOTA baselines, and is highly compatible for various ID-based models and PLMs.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - ProtoNER: Few shot Incremental Learning for Named Entity Recognition
using Prototypical Networks [7.317342506617286]
Prototypical Network based end-to-end KVP extraction model is presented.
No dependency on dataset used for initial training of the model.
No intermediate synthetic data generation which tends to add noise and results in model's performance degradation.
arXiv Detail & Related papers (2023-10-03T18:52:19Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Generative Entity Typing with Curriculum Learning [18.43562065432877]
We propose a novel generative entity typing (GET) paradigm.
Given a text with an entity mention, the multiple types for the role that the entity plays in the text are generated with a pre-trained language model.
Our experiments justify the superiority of our GET model over the state-of-the-art entity typing models.
arXiv Detail & Related papers (2022-10-06T13:32:50Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - Fuzzy Simplicial Networks: A Topology-Inspired Model to Improve Task
Generalization in Few-shot Learning [1.0062040918634414]
Few-shot learning algorithms are designed to generalize well to new tasks with limited data.
We introduce a new few-shot model called Fuzzy Simplicial Networks (FSN) which leverages a construction from topology to more flexibly represent each class from limited data.
arXiv Detail & Related papers (2020-09-23T17:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.