Related papers: Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

URL: http://arxiv.org/abs/2308.03656v6
Date: Fri, 04 Oct 2024 20:02:14 GMT
Title: Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench
Authors: Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu,
Abstract summary: We evaluate Large Language Models' (LLMs) anthropomorphic capabilities using the emotion appraisal theory from psychology. We collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. We conduct a human evaluation involving more than 1,200 subjects worldwide.
Score: 83.41621219298489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating Large Language Models' (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes seven LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4, Mixtral-8x22B, and LLaMA-3.1. We find that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our collected dataset of situations, the human evaluation results, and the code of our testing framework, i.e., EmotionBench, are publicly available at https://github.com/CUHK-ARISE/EmotionBench.

Related papers

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models [75.85319609088354]
Sentient Agent as a Judge (SAGE) is an evaluation framework for large language models.<n>SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction.<n>SAGE provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
arXiv Detail & Related papers (2025-05-01T19:06:10Z)
AI with Emotions: Exploring Emotional Expressions in Large Language Models [0.0]
Large Language Models (LLMs) play role-play as agents answering questions with specified emotional states. Russell's Circumplex model characterizes emotions along the sleepy-activated (arousal) and pleasure-displeasure (valence) axes. evaluation showed that the emotional states of the generated answers were consistent with the specifications.
arXiv Detail & Related papers (2025-04-20T18:49:25Z)
Do Large Language Models Possess Sensitive to Sentiment? [18.88126980975737]
Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. This paper investigates the ability of LLMs to detect and react to sentiment in text modal.
arXiv Detail & Related papers (2024-09-04T01:40:20Z)
Recognizing Emotion Regulation Strategies from Human Behavior with Large Language Models [44.015651538470856]
Human emotions are often not expressed directly, but regulated according to internal processes and social display rules. No method to automatically classify different emotion regulation strategies in a cross-user scenario exists. We make use of the recently introduced textscDeep corpus for modeling the social display of the emotion shame. A fine-tuned Llama2-7B model is able to classify the utilized emotion regulation strategy with high accuracy.
arXiv Detail & Related papers (2024-08-08T12:47:10Z)
Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation. We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning [55.127202990679976]
We introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. We propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders.
arXiv Detail & Related papers (2024-06-17T03:01:22Z)
Are Large Language Models More Empathetic than Humans? [14.18033127602866]
GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as "Good" compared to the human benchmark. Some LLMs are significantly better at responding to specific emotions compared to others.
arXiv Detail & Related papers (2024-06-07T16:33:43Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models [47.890846082224066]
This work fills the gap by presenting CovidET-Appraisals, the most comprehensive dataset to-date that assesses 24 appraisal dimensions. CovidET-Appraisals presents an ideal testbed to evaluate the ability of large language models to automatically assess and explain cognitive appraisals.
arXiv Detail & Related papers (2023-10-22T19:12:17Z)
Emotional Intelligence of Large Language Models [9.834823298632374]
Large Language Models (LLMs) have demonstrated remarkable abilities across numerous disciplines. However, their alignment with human emotions and values, which is critical for real-world applications, has not been systematically evaluated. Here, we assessed LLMs' Emotional Intelligence (EI), encompassing emotion recognition, interpretation, and understanding.
arXiv Detail & Related papers (2023-07-18T07:49:38Z)
Large Language Models Understand and Can be Enhanced by Emotional Stimuli [53.53886609012119]
We take the first step towards exploring the ability of Large Language Models to understand emotional stimuli. Our experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts. Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks.
arXiv Detail & Related papers (2023-07-14T00:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.