Related papers: ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs

ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs

URL: http://arxiv.org/abs/2504.01698v1
Date: Wed, 02 Apr 2025 12:58:42 GMT
Title: ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs
Authors: Yi-Long Lu, Chunhui Zhang, Jiajun Song, Lifeng Fan, Wei Wang,
Abstract summary: We show that rule-based reinforcement learning can unlock Theory of Mind (ToM) reasoning capabilities even in small-scale language models.<n>Our RL-trained 7B model achieves 84.50% accuracy on the Hi-ToM benchmark, surpassing models like GPT-4o and DeepSeek-v3.<n>These findings highlight RL's potential to enhance social cognitive reasoning, bridging the gap between structured problem-solving and nuanced social inference.
Score: 14.29992535286614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in rule-based reinforcement learning (RL), applied during the post-training phase of large language models (LLMs), have significantly enhanced their capabilities in structured reasoning tasks such as mathematics and logical inference. However, the effectiveness of RL in social reasoning, particularly in Theory of Mind (ToM), the ability to infer others' mental states, remains largely unexplored. In this study, we demonstrate that RL methods effectively unlock ToM reasoning capabilities even in small-scale LLMs (0.5B to 7B parameters). Using a modest dataset comprising 3200 questions across diverse scenarios, our RL-trained 7B model achieves 84.50\% accuracy on the Hi-ToM benchmark, surpassing models like GPT-4o and DeepSeek-v3 despite significantly fewer parameters. While smaller models ($\leq$3B parameters) suffer from reasoning collapse, larger models (7B parameters) maintain stable performance through consistent belief tracking. Additionally, our RL-based models demonstrate robust generalization to higher-order, out-of-distribution ToM problems, novel textual presentations, and previously unseen datasets. These findings highlight RL's potential to enhance social cognitive reasoning, bridging the gap between structured problem-solving and nuanced social inference in LLMs.

Related papers

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [135.1260782461186]
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) However, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. We present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward.
arXiv Detail & Related papers (2025-04-30T00:04:35Z)
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs. We re-examine this assumption by measuring the pass@textitk metric with large values of textitk to explore the reasoning capability boundary of the models. We find that the RL does emphnot, in fact, elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z)
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models. We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs) We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models. We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z)
Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models [53.4530106173067]
Large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks. RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge.
arXiv Detail & Related papers (2025-04-03T04:46:17Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking [39.48406368755411]
Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs)<n>RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting necessary to enhance the reasoning capability of RLLMs?"<n>We present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-25T12:37:22Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
Towards Reasoning Ability of Small Language Models [3.732224317444325]
We show that small language models (SLMs) can achieve competitive reasoning performance.<n>We systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks.<n>Our findings challenge the assumption that scaling is the only way to achieve strong reasoning.
arXiv Detail & Related papers (2025-02-17T08:59:16Z)
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition [34.32871896067864]
We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP)<n> RLSP involves three steps: supervised fine-tuning with human or synthetic demonstrations of the reasoning process, using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and RL training with an outcome verifier to ensure correctness while preventing reward hacking.<n> Empirical studies in the math domain show that RLSP improves reasoning.
arXiv Detail & Related papers (2025-02-10T18:52:04Z)
Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.<n>Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.<n>Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z)
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z)
A Survey on Model-based Reinforcement Learning [21.85904195671014]
Reinforcement learning (RL) solves sequential decision-making problems via a trial-and-error process interacting with the environment. Model-based reinforcement learning (MBRL) is believed to be a promising direction, which builds environment models in which the trial-and-errors can take place without real costs.
arXiv Detail & Related papers (2022-06-19T05:28:03Z)
Stock Trading Optimization through Model-based Reinforcement Learning with Resistance Support Relative Strength [4.322320095367326]
We design a new approach that leverages resistance and support (RS) level as regularization terms for action in model-based reinforcement learning (MBRL) algorithms. Our proposed method even resists big drop (less maximum drawdown) during COVID-19 pandemic period when the financial market got unpredictable crisis.
arXiv Detail & Related papers (2022-05-30T12:36:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.