Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation
- URL: http://arxiv.org/abs/2306.01183v1
- Date: Thu, 1 Jun 2023 22:43:37 GMT
- Title: Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation
- Authors: Adithya V Ganesan, Yash Kumar Lal, August H{\aa}kan Nilsson, H. Andrew
Schwartz
- Abstract summary: GPT-3 is used to estimate the Big 5 personality traits from users' social media posts.
We find that GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification.
We analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors.
- Score: 12.777659013330823
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Very large language models (LLMs) perform extremely well on a spectrum of NLP
tasks in a zero-shot setting. However, little is known about their performance
on human-level NLP problems which rely on understanding psychological concepts,
such as assessing personality traits. In this work, we investigate the
zero-shot ability of GPT-3 to estimate the Big 5 personality traits from users'
social media posts. Through a set of systematic experiments, we find that
zero-shot GPT-3 performance is somewhat close to an existing pre-trained SotA
for broad classification upon injecting knowledge about the trait in the
prompts. However, when prompted to provide fine-grained classification, its
performance drops to close to a simple most frequent class (MFC) baseline. We
further analyze where GPT-3 performs better, as well as worse, than a
pretrained lexical model, illustrating systematic errors that suggest ways to
improve LLMs on human-level NLP tasks.
Related papers
- Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance [11.595274304409937]
Large language models (LLMs) have revolutionized zero-shot task performance.
Current methods using trigger phrases such as "Let's think step by step" remain limited.
This study introduces PRomPTed, an approach that optimize the zero-shot prompts for individual task instances.
arXiv Detail & Related papers (2023-10-03T14:51:34Z) - Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing
Perspective [63.92197404447808]
Large language models (LLMs) have shown some human-like cognitive abilities.
We propose an adaptive testing framework for LLM evaluation.
This approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - Evaluating Psychological Safety of Large Language Models [72.88260608425949]
We designed unbiased prompts to evaluate the psychological safety of large language models (LLMs)
We tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI)
Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns.
Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model.
arXiv Detail & Related papers (2022-12-20T18:45:07Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Using cognitive psychology to understand GPT-3 [0.0]
We study GPT-3, a recent large language model, using tools from cognitive psychology.
We assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities.
arXiv Detail & Related papers (2022-06-21T20:06:03Z) - Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again [24.150464908060112]
We present the first systematic and comprehensive study to compare the few-shot performance of GPT-3 in-context learning with fine-tuning smaller (i.e., BERT-sized) PLMs.
Our results show that GPT-3 still significantly underperforms compared with simply fine-tuning a smaller PLM using the same small training set.
arXiv Detail & Related papers (2022-03-16T05:56:08Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.