Related papers: Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation

Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation

URL: http://arxiv.org/abs/2306.01183v1
Date: Thu, 1 Jun 2023 22:43:37 GMT
Title: Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation
Authors: Adithya V Ganesan, Yash Kumar Lal, August H{\aa}kan Nilsson, H. Andrew Schwartz
Abstract summary: GPT-3 is used to estimate the Big 5 personality traits from users' social media posts. We find that GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification. We analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors.
Score: 12.777659013330823
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Very large language models (LLMs) perform extremely well on a spectrum of NLP tasks in a zero-shot setting. However, little is known about their performance on human-level NLP problems which rely on understanding psychological concepts, such as assessing personality traits. In this work, we investigate the zero-shot ability of GPT-3 to estimate the Big 5 personality traits from users' social media posts. Through a set of systematic experiments, we find that zero-shot GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification upon injecting knowledge about the trait in the prompts. However, when prompted to provide fine-grained classification, its performance drops to close to a simple most frequent class (MFC) baseline. We further analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors that suggest ways to improve LLMs on human-level NLP tasks.

Related papers

A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets. prompt engineering played a crucial role in enhancing model performance. Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP) MVP improves performance against adversarial substitutions by an average of 8% over standard methods. We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z)
Evaluating Psychological Safety of Large Language Models [72.88260608425949]
We designed unbiased prompts to evaluate the psychological safety of large language models (LLMs) We tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI) Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns. Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model.
arXiv Detail & Related papers (2022-12-20T18:45:07Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
Using cognitive psychology to understand GPT-3 [0.0]
We study GPT-3, a recent large language model, using tools from cognitive psychology. We assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities.
arXiv Detail & Related papers (2022-06-21T20:06:03Z)
Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again [24.150464908060112]
We present the first systematic and comprehensive study to compare the few-shot performance of GPT-3 in-context learning with fine-tuning smaller (i.e., BERT-sized) PLMs. Our results show that GPT-3 still significantly underperforms compared with simply fine-tuning a smaller PLM using the same small training set.
arXiv Detail & Related papers (2022-03-16T05:56:08Z)
Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance. We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.