RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems
- URL: http://arxiv.org/abs/2511.22275v1
- Date: Thu, 27 Nov 2025 09:58:29 GMT
- Title: RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems
- Authors: Mengfan Li, Xuanhua Shi, Yang Deng,
- Abstract summary: We propose RecToM, a novel benchmark for evaluating Large Language models.<n> RecToM focuses on two complementary dimensions: Cognitive Inference and Behavioral Prediction.<n>Extensive experiments on state-of-the-art LLMs demonstrate that RecToM poses a significant challenge.
- Score: 23.229692182223157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language models are revolutionizing the conversational recommender systems through their impressive capabilities in instruction comprehension, reasoning, and human interaction. A core factor underlying effective recommendation dialogue is the ability to infer and reason about users' mental states (such as desire, intention, and belief), a cognitive capacity commonly referred to as Theory of Mind. Despite growing interest in evaluating ToM in LLMs, current benchmarks predominantly rely on synthetic narratives inspired by Sally-Anne test, which emphasize physical perception and fail to capture the complexity of mental state inference in realistic conversational settings. Moreover, existing benchmarks often overlook a critical component of human ToM: behavioral prediction, the ability to use inferred mental states to guide strategic decision-making and select appropriate conversational actions for future interactions. To better align LLM-based ToM evaluation with human-like social reasoning, we propose RecToM, a novel benchmark for evaluating ToM abilities in recommendation dialogues. RecToM focuses on two complementary dimensions: Cognitive Inference and Behavioral Prediction. The former focus on understanding what has been communicated by inferring the underlying mental states. The latter emphasizes what should be done next, evaluating whether LLMs can leverage these inferred mental states to predict, select, and assess appropriate dialogue strategies. Extensive experiments on state-of-the-art LLMs demonstrate that RecToM poses a significant challenge. While the models exhibit partial competence in recognizing mental states, they struggle to maintain coherent, strategic ToM reasoning throughout dynamic recommendation dialogues, particularly in tracking evolving intentions and aligning conversational strategies with inferred mental states.
Related papers
- A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction [50.05919688888947]
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT)<n>IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision.<n> Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation.
arXiv Detail & Related papers (2026-01-08T14:07:30Z) - Infusing Theory of Mind into Socially Intelligent LLM Agents [31.88529787413754]
Theory of Mind (ToM) is a key aspect of human social intelligence.<n>We show that social agents that explicitly use ToM get better at dialogue, achieving goals more effectively.<n>We introduce ToMAgent (ToMA), a ToM-focused dialogue agent.
arXiv Detail & Related papers (2025-09-26T20:07:34Z) - Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs [35.33577525791391]
This study moves beyond question generation to emphasize instructional guidance capability.<n>We propose GuideEval, a benchmark grounded in authentic educational dialogues.<n>We introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues.
arXiv Detail & Related papers (2025-08-08T01:02:44Z) - Theory of Mind in Large Language Models: Assessment and Enhancement [26.35781229730513]
Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence.<n>As Large Language Models (LLMs) become increasingly integrated into daily life, understanding their ability to interpret and respond to human mental states is crucial for enabling effective interactions.
arXiv Detail & Related papers (2025-04-26T10:17:48Z) - PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues [27.231701486961917]
We propose PersuasiveToM, a benchmark designed to evaluate the Theory of Mind abilities of Large Language Models.<n>Our framework contains two core tasks: ToM Reasoning and ToM Application.<n>Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities.
arXiv Detail & Related papers (2025-02-28T13:04:04Z) - Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection [31.38516078163367]
ToM-agent is designed to empower LLMs-based generative agents to simulate ToM in open-domain conversational interactions.<n>ToM-agent disentangles the confidence from mental states, facilitating the emulation of an agent's perception of its counterpart's mental states.<n>Our findings indicate that the ToM-agent can grasp the underlying reasons for their counterpart's behaviors beyond mere semantic-emotional supporting or decision-making based on common sense.
arXiv Detail & Related papers (2025-01-26T00:32:38Z) - Can LLMs Understand the Implication of Emphasized Sentences in Dialogue? [64.72966061510375]
Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue.
This paper introduces Emphasized-Talk, a benchmark with emphasis-annotated dialogue samples capturing the implications of emphasis.
We evaluate various Large Language Models (LLMs), both open-source and commercial, to measure their performance in understanding emphasis.
arXiv Detail & Related papers (2024-06-16T20:41:44Z) - NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding [55.38254464415964]
Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations.
We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states.
arXiv Detail & Related papers (2024-04-21T11:51:13Z) - Rational Sensibility: LLM Enhanced Empathetic Response Generation Guided by Self-presentation Theory [8.439724621886779]
The development of Large Language Models (LLMs) provides human-centered Artificial General Intelligence (AGI) with a glimmer of hope.
Empathy serves as a key emotional attribute of humanity, playing an irreplaceable role in human-centered AGI.
In this paper, we design an innovative encoder module inspired by self-presentation theory in sociology, which specifically processes sensibility and rationality sentences in dialogues.
arXiv Detail & Related papers (2023-12-14T07:38:12Z) - From Heuristic to Analytic: Cognitively Motivated Strategies for
Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions.
Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.