Evaluating Psychological Safety of Large Language Models
- URL: http://arxiv.org/abs/2212.10529v3
- Date: Thu, 29 Feb 2024 13:14:37 GMT
- Title: Evaluating Psychological Safety of Large Language Models
- Authors: Xingxuan Li, Yutong Li, Lin Qiu, Shafiq Joty, Lidong Bing
- Abstract summary: We designed unbiased prompts to evaluate the psychological safety of large language models (LLMs)
We tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI)
Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns.
Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model.
- Score: 72.88260608425949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we designed unbiased prompts to systematically evaluate the
psychological safety of large language models (LLMs). First, we tested five
different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big
Five Inventory (BFI). All models scored higher than the human average on SD-3,
suggesting a relatively darker personality pattern. Despite being instruction
fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and
GPT-4 still showed dark personality patterns; these models scored higher than
self-supervised GPT-3 on the Machiavellianism and narcissism traits on SD-3.
Then, we evaluated the LLMs in the GPT series by using well-being tests to
study the impact of fine-tuning with more training data. We observed a
continuous increase in the well-being scores of GPT models. Following these
observations, we showed that fine-tuning Llama-2-chat-7B with responses from
BFI using direct preference optimization could effectively reduce the
psychological toxicity of the model. Based on the findings, we recommended the
application of systematic and comprehensive psychological metrics to further
evaluate and improve the safety of LLMs.
Related papers
- Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of Large Language Models' implicit bias towards certain groups by attacking them with carefully crafted instructions to elicit biased responses.
We propose three attack approaches, i.e., Disguise, Deception, and Teaching, based on which we built evaluation datasets for four common bias types.
arXiv Detail & Related papers (2024-06-20T06:42:08Z) - Large Language Models Show Human-like Social Desirability Biases in Survey Responses [12.767606361552684]
We show that Large Language Models (LLMs) skew their scores towards the desirable ends of trait dimensions when personality evaluation is inferred.
This bias exists in all tested models, including GPT-4/3.5, Claude 3, Llama 3, and PaLM-2.
reverse-coding all the questions decreases bias levels but does not eliminate them, suggesting that this effect cannot be attributed to acquiescence bias.
arXiv Detail & Related papers (2024-05-09T19:02:53Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically.
In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs.
Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z) - Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation [12.777659013330823]
GPT-3 is used to estimate the Big 5 personality traits from users' social media posts.
We find that GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification.
We analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors.
arXiv Detail & Related papers (2023-06-01T22:43:37Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - Evaluating the Susceptibility of Pre-Trained Language Models via
Handcrafted Adversarial Examples [0.0]
We highlight a major security vulnerability in the public release of GPT-3 and investigate this vulnerability in other state-of-the-art PLMs.
We underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures.
arXiv Detail & Related papers (2022-09-05T20:29:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.