Larger Probes Tell a Different Story: Extending Psycholinguistic
Datasets Via In-Context Learning
- URL: http://arxiv.org/abs/2303.16445v3
- Date: Tue, 14 Nov 2023 17:24:28 GMT
- Title: Larger Probes Tell a Different Story: Extending Psycholinguistic
Datasets Via In-Context Learning
- Authors: Namrata Shivagunde, Vladislav Lialin, and Anna Rumshisky
- Abstract summary: We introduce new, larger datasets for negation and role reversal inspired by psycholinguistic studies.
We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each.
We evaluate 22 models on the extended datasets, seeing model performance dip 20-57% compared to the original smaller benchmarks.
- Score: 14.606961537327345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language model probing is often used to test specific capabilities of models.
However, conclusions from such studies may be limited when the probing
benchmarks are small and lack statistical power. In this work, we introduce
new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500)
inspired by psycholinguistic studies. We dramatically extend existing NEG-136
and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44
sentence pairs to 750 each. We also create another version of extended negation
dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It
consists of 770 sentence pairs. We evaluate 22 models on the extended datasets,
seeing model performance dip 20-57% compared to the original smaller
benchmarks. We observe high levels of negation sensitivity in models like BERT
and ALBERT demonstrating that previous findings might have been skewed due to
smaller test sets. Finally, we observe that while GPT3 has generated all the
examples in ROLE-1500 is only able to solve 24.6% of them during probing. The
datasets and code are available on
$\href{https://github.com/text-machine-lab/extending_psycholinguistic_dataset}{Github}$.
Related papers
- Zephyr: Direct Distillation of LM Alignment [59.03530095974505]
We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying supervised fine-tuning (dSFT) on larger models significantly improves task accuracy.
We apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment.
arXiv Detail & Related papers (2023-10-25T19:25:16Z) - Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification.
Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks.
More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z) - Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023? [10.789928720739734]
We evaluate the generalization of over 20 different models trained on CoNLL-2003.
Surprisingly, we find no evidence of performance degradation in pre-trained Transformers, such as RoBERTa and T5.
Our analysis suggests that most deterioration is due to temporal mismatch between the pre-training corpora and the downstream test sets.
arXiv Detail & Related papers (2022-12-19T18:59:56Z) - How to train your draGAN: A task oriented solution to imbalanced
classification [15.893327571516016]
This paper proposes a unique, performance-oriented, data-generating strategy that utilizes a new architecture, coined draGAN.
The samples are generated with the objective of optimizing the classification model's performance, rather than similarity to the real data.
Empirically we show the superiority of draGAN, but also highlight some of its shortcomings.
arXiv Detail & Related papers (2022-11-18T07:37:34Z) - Deconstructing Distributions: A Pointwise Framework of Learning [15.517383696434162]
We study a point's $textitprofile$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point.
We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution.
arXiv Detail & Related papers (2022-02-20T23:25:28Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - The effects of regularisation on RNN models for time series forecasting:
Covid-19 as an example [2.5397218862229254]
This paper presents a model with grater flexibility than the other proposed neural networks.
To improve performance on small data, six regularisation methods were tested.
Applying Dropout to a GRU model trained on only 28 days of data reduced the RMSE by 23%.
arXiv Detail & Related papers (2021-05-09T10:50:57Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data.
We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets.
Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.