Related papers: Towards Safety Evaluations of Theory of Mind in Large Language Models

Towards Safety Evaluations of Theory of Mind in Large Language Models

URL: http://arxiv.org/abs/2506.17352v2
Date: Wed, 02 Jul 2025 00:16:28 GMT
Title: Towards Safety Evaluations of Theory of Mind in Large Language Models
Authors: Tatsuhiro Aoshima, Mitsuaki Akiyama,
Abstract summary: Large language models (LLMs) exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner.<n>To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to measure the theory of mind capabilities of LLMs.<n>Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development.
Score: 5.431189652149939
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their behavior. To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin by reviewing existing research on theory of mind and identifying the perspectives and tasks relevant to its application in safety evaluation. Given that theory of mind has been predominantly studied within the context of developmental psychology, we analyze developmental trends across a series of open-weight LLMs. Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development. Finally, we present the current state of safety evaluation with respect to LLMs' theory of mind, and discuss remaining challenges for future work.

Related papers

The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs [42.57873562187369]
Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP)<n>LLMs have occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios.<n>This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation.
arXiv Detail & Related papers (2025-06-06T05:50:50Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning [40.55486479495965]
Large Language Models (LLMs) have demonstrated remarkable success across various NLP benchmarks.<n>In this work, we investigate the interplay between reasoning and safety in LLMs.<n>We highlight the latent safety risks that arise as reasoning capabilities improve, shedding light on previously overlooked vulnerabilities.
arXiv Detail & Related papers (2025-02-13T06:37:28Z)
Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models [0.0]
Recent advances in Large Language Models have incorporated planning and reasoning capabilities.<n>This has reduced errors in mathematical and logical tasks while improving accuracy.<n>Our study examines DeepSeek R1, a model trained to output reasoning tokens similar to OpenAI's o1.
arXiv Detail & Related papers (2025-01-27T21:26:37Z)
Current state of LLM Risks and AI Guardrails [0.0]
Large language models (LLMs) have become increasingly sophisticated, leading to widespread deployment in sensitive applications where safety and reliability are paramount. These risks necessitate the development of "guardrails" to align LLMs with desired behaviors and mitigate potential harm. This work explores the risks associated with deploying LLMs and evaluates current approaches to implementing guardrails and model alignment techniques.
arXiv Detail & Related papers (2024-06-16T22:04:10Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
Evaluating Large Language Models: A Comprehensive Survey [41.64914110226901]
Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation.
arXiv Detail & Related papers (2023-10-30T17:00:52Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes. To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.