Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation
- URL: http://arxiv.org/abs/2306.01183v1
- Date: Thu, 1 Jun 2023 22:43:37 GMT
- Title: Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation
- Authors: Adithya V Ganesan, Yash Kumar Lal, August H{\aa}kan Nilsson, H. Andrew
Schwartz
- Abstract summary: GPT-3 is used to estimate the Big 5 personality traits from users' social media posts.
We find that GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification.
We analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors.
- Score: 12.777659013330823
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Very large language models (LLMs) perform extremely well on a spectrum of NLP
tasks in a zero-shot setting. However, little is known about their performance
on human-level NLP problems which rely on understanding psychological concepts,
such as assessing personality traits. In this work, we investigate the
zero-shot ability of GPT-3 to estimate the Big 5 personality traits from users'
social media posts. Through a set of systematic experiments, we find that
zero-shot GPT-3 performance is somewhat close to an existing pre-trained SotA
for broad classification upon injecting knowledge about the trait in the
prompts. However, when prompted to provide fine-grained classification, its
performance drops to close to a simple most frequent class (MFC) baseline. We
further analyze where GPT-3 performs better, as well as worse, than a
pretrained lexical model, illustrating systematic errors that suggest ways to
improve LLMs on human-level NLP tasks.
Related papers
- A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - Evaluating Psychological Safety of Large Language Models [72.88260608425949]
We designed unbiased prompts to evaluate the psychological safety of large language models (LLMs)
We tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI)
Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns.
Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model.
arXiv Detail & Related papers (2022-12-20T18:45:07Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Using cognitive psychology to understand GPT-3 [0.0]
We study GPT-3, a recent large language model, using tools from cognitive psychology.
We assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities.
arXiv Detail & Related papers (2022-06-21T20:06:03Z) - Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again [24.150464908060112]
We present the first systematic and comprehensive study to compare the few-shot performance of GPT-3 in-context learning with fine-tuning smaller (i.e., BERT-sized) PLMs.
Our results show that GPT-3 still significantly underperforms compared with simply fine-tuning a smaller PLM using the same small training set.
arXiv Detail & Related papers (2022-03-16T05:56:08Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.