Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023?
- URL: http://arxiv.org/abs/2212.09747v2
- Date: Wed, 12 Jul 2023 02:41:46 GMT
- Title: Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023?
- Authors: Shuheng Liu, Alan Ritter
- Abstract summary: We evaluate the generalization of over 20 different models trained on CoNLL-2003.
Surprisingly, we find no evidence of performance degradation in pre-trained Transformers, such as RoBERTa and T5.
Our analysis suggests that most deterioration is due to temporal mismatch between the pre-training corpora and the downstream test sets.
- Score: 10.789928720739734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The CoNLL-2003 English named entity recognition (NER) dataset has been widely
used to train and evaluate NER models for almost 20 years. However, it is
unclear how well models that are trained on this 20-year-old data and developed
over a period of decades using the same test set will perform when applied on
modern data. In this paper, we evaluate the generalization of over 20 different
models trained on CoNLL-2003, and show that NER models have very different
generalization. Surprisingly, we find no evidence of performance degradation in
pre-trained Transformers, such as RoBERTa and T5, even when fine-tuned using
decades-old data. We investigate why some models generalize well to new data
while others do not, and attempt to disentangle the effects of temporal drift
and overfitting due to test reuse. Our analysis suggests that most
deterioration is due to temporal mismatch between the pre-training corpora and
the downstream test sets. We found that four factors are important for good
generalization: model architecture, number of parameters, time period of the
pre-training corpus, in addition to the amount of fine-tuning data. We suggest
current evaluation methods have, in some sense, underestimated progress on NER
over the past 20 years, as NER models have not only improved on the original
CoNLL-2003 test set, but improved even more on modern data. Our datasets can be
found at https://github.com/ShuhengL/acl2023_conllpp.
Related papers
- GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation [90.53485251837235]
Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training.
GIFT-Eval is a pioneering benchmark aimed at promoting evaluation across diverse datasets.
GIFT-Eval encompasses 23 datasets over 144,000 time series and 177 million data points.
arXiv Detail & Related papers (2024-10-14T11:29:38Z) - Polyp Segmentation Generalisability of Pretrained Backbones [12.991813293135195]
We consider how well models with different pretrained backbones generalise to data of a somewhat different distribution to the training data.
Our results imply that models with ResNet50 backbones typically generalise better, despite being outperformed by models with ViT-B backbones.
arXiv Detail & Related papers (2024-05-24T13:09:52Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - Machine Learning Models in Stock Market Prediction [0.0]
The paper focuses on predicting the Nifty 50 Index by using 8 Supervised Machine Learning Models.
Experiments are based on historical data of Nifty 50 Index of Indian Stock Market from 22nd April, 1996 to 16th April, 2021.
arXiv Detail & Related papers (2022-02-06T10:33:42Z) - Hyperparameter-free Continuous Learning for Domain Classification in
Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU)
Most existing continual learning approaches suffer from low accuracy and performance fluctuation.
We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Beat the AI: Investigating Adversarial Human Annotation for Reading
Comprehension [27.538957000237176]
Humans create questions adversarially, such that the model fails to answer them correctly.
We collect 36,000 samples with progressively stronger models in the annotation loop.
We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets.
We find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.
arXiv Detail & Related papers (2020-02-02T00:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.