Understanding Social Reasoning in Language Models with Language Models
- URL: http://arxiv.org/abs/2306.15448v2
- Date: Mon, 4 Dec 2023 22:31:26 GMT
- Title: Understanding Social Reasoning in Language Models with Language Models
- Authors: Kanishk Gandhi, Jan-Philipp Fr\"anken, Tobias Gerstenberg, Noah D.
Goodman
- Abstract summary: We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates.
We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations.
We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
- Score: 34.068368860882586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) become increasingly integrated into our
everyday lives, understanding their ability to comprehend human mental states
becomes critical for ensuring effective interactions. However, despite the
recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of
LLMs, the degree to which these models can align with human ToM remains a
nuanced topic of exploration. This is primarily due to two distinct challenges:
(1) the presence of inconsistent results from previous evaluations, and (2)
concerns surrounding the validity of existing evaluation methodologies. To
address these challenges, we present a novel framework for procedurally
generating evaluations with LLMs by populating causal templates. Using our
framework, we create a new social reasoning benchmark (BigToM) for LLMs which
consists of 25 controls and 5,000 model-written evaluations. We find that human
participants rate the quality of our benchmark higher than previous
crowd-sourced evaluations and comparable to expert-written evaluations. Using
BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and
compare model performances with human performance. Our results suggest that
GPT4 has ToM capabilities that mirror human inference patterns, though less
reliable, while other LLMs struggle.
Related papers
- Evaluating the Performance of Large Language Models via Debates [43.40134389150456]
Large Language Models (LLMs) are rapidly evolving and impacting various fields.
Most current approaches for performance evaluation are either based on fixed, domain-specific questions, or rely on human input.
We propose an automated benchmarking framework based on debates between LLMs, judged by another LLM.
This method assesses not only domain knowledge, but also skills such as argumentative reasoning and inconsistency recognition.
arXiv Detail & Related papers (2024-06-16T19:02:31Z) - ToMBench: Benchmarking Theory of Mind in Large Language Models [41.565202027904476]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Post Turing: Mapping the landscape of LLM Evaluation [22.517544562890663]
This paper traces the historical trajectory of Large Language Models (LLMs) evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research.
We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models.
This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
arXiv Detail & Related papers (2023-11-03T17:24:50Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.
This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)
We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.