A negation detection assessment of GPTs: analysis with the xNot360
dataset
- URL: http://arxiv.org/abs/2306.16638v1
- Date: Thu, 29 Jun 2023 02:27:48 GMT
- Title: A negation detection assessment of GPTs: analysis with the xNot360
dataset
- Authors: Ha Thanh Nguyen, Randy Goebel, Francesca Toni, Kostas Stathis, Ken
Satoh
- Abstract summary: Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension.
We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset.
Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction.
- Score: 9.165119034384027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Negation is a fundamental aspect of natural language, playing a critical role
in communication and comprehension. Our study assesses the negation detection
performance of Generative Pre-trained Transformer (GPT) models, specifically
GPT-2, GPT-3, GPT-3.5, and GPT-4. We focus on the identification of negation in
natural language using a zero-shot prediction approach applied to our custom
xNot360 dataset. Our approach examines sentence pairs labeled to indicate
whether the second sentence negates the first. Our findings expose a
considerable performance disparity among the GPT models, with GPT-4 surpassing
its counterparts and GPT-3.5 displaying a marked performance reduction. The
overall proficiency of the GPT models in negation detection remains relatively
modest, indicating that this task pushes the boundaries of their natural
language understanding capabilities. We not only highlight the constraints of
GPT models in handling negation but also emphasize the importance of logical
reliability in high-stakes domains such as healthcare, science, and law.
Related papers
- AGB-DE: A Corpus for the Automated Legal Assessment of Clauses in German Consumer Contracts [4.427516854041417]
We introduce AGB-DE, a corpus of 3,764 clauses from German consumer contracts that have been annotated and legally assessed by legal experts.
We compare the performance of an SVM baseline with three fine-tuned open language models and the performance of GPT-3.5.
An analysis of the errors indicates that one of the main challenges could be the correct interpretation of complex clauses.
arXiv Detail & Related papers (2024-06-10T21:27:13Z) - An Empirical Analysis on Large Language Models in Debate Evaluation [10.677407097411768]
We investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation.
We uncover a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented.
We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential.
arXiv Detail & Related papers (2024-05-28T18:34:53Z) - An Analysis of Language Frequency and Error Correction for Esperanto [0.0]
We conduct a comprehensive frequency analysis using the Eo-GP dataset.
We then introduce the Eo-GEC dataset, derived from authentic user cases.
Using GPT-3.5 and GPT-4, our experiments show that GPT-4 outperforms GPT-3.5 in both automated and human evaluations.
arXiv Detail & Related papers (2024-02-15T04:10:25Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot
Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks.
In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z) - Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error
Correction [28.58384091374763]
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks.
We perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks.
We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting.
arXiv Detail & Related papers (2023-03-25T03:08:49Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Improving negation detection with negation-focused pre-training [58.32362243122714]
Negation is a common linguistic feature that is crucial in many language understanding tasks.
Recent work has shown that state-of-the-art NLP models underperform on samples containing negation.
We propose a new negation-focused pre-training strategy, involving targeted data augmentation and negation masking.
arXiv Detail & Related papers (2022-05-09T02:41:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.