Fugu-MT 論文翻訳(概要): From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

論文の概要: From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.09092v1
Date: Mon, 08 Jun 2026 06:42:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.769469
Title: From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning
Title（参考訳）: ショートカットから推論へ:強化学習による心の理論のロバストな後学習
Authors: Jike Zhong, Yuxiang Lai, Ming Li, Yuheng Li, Wuao Liu, Behzad Dariush, Konstantinos Psounis, Shao-Yuan Lo,
Abstract要約: 理論・オブ・マインド(Theory of Mind、ToM)は、現代の基礎モデルシステムにおいて必須のスキルである。ポストトレーニングによるToMのホーニングについて最近の研究が進められている。このような進歩は、広範囲にわたる「ショートカット」問題によって構築されていることを示す。
参考スコア（独自算出の注目度）: 27.941974053779745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Theory of Mind (ToM) is a must-acquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is confounded by a pervasive "shortcut" issue: tasks can reach up to 99% accuracy by simply exploiting spurious causal correlations, leading to a false sense of ToM. Motivated by this, we first develop a framework to systematically examine ToM datasets for shortcuts and provide guidance for future development. We find that questions reducible to pure state tracking, such as "belief," are especially shortcut-prone compared to mind questions, such as "intention," where reasoning beyond tracking is required. Using four shortcut-free datasets across three ToM contexts, we then comprehensively study whether Reinforcement Fine-Tuning with verifiable rewards and explicit reasoning chains, called Thinking-RFT, elevates ToM beyond Supervised Fine-Tuning, or SFT. Our key findings are as follows. First, Thinking-RFT effectively improves ToM in all scenarios, with a 6% improvement over SFT, particularly in complex higher-order reasoning, with a 10% improvement over SFT, and multimodal cases, with a 7% improvement over SFT. It also generalizes notably better to unseen domains and higher-order queries while being more robust to counterfactuals. Second, ToM benefits specifically from the joint effect of reasoning and RL: Thinking-RFT outperforms Non-Thinking-RFT by 7% on average. Third, RFT works by learning to ground its reasoning on anchor cues, such as keywords and state changes, that correspond to causal factors. We believe our study is useful for developing effective and robust ToM post-training datasets and advancing critical ToM capabilities.
Abstract（参考訳）: 理論・オブ・マインド(Theory of Mind、ToM)は、現代の基礎モデルシステムにおいて、現実世界で効果的かつ安全に活動するために必要な技術である。最近の研究は, ポストトレーニングによるToMのホーミングについて検討しているが, タスクは, 素早い因果相関を利用して最大99%の精度を達成でき, 誤ったToMの感覚を導き出すという, 広範囲な「ショートカット」問題によって, そうした進歩が確立されていることを示唆している。そこで我々はまず,ショートカットのためのToMデータセットを体系的に検証し,今後の開発のためのガイダンスを提供するフレームワークを開発した。我々は、純粋な状態追跡に再現可能な質問、例えば「信」は、特に、追跡以上の推論を必要とする「意図」のようなマインドの問題と比較して、ショートカット傾向にあることを発見した。 3つのToMコンテキストにわたる4つのショートカットフリーデータセットを使用して、検証可能な報酬と明確な推論チェーンであるThinking-RFTによる強化ファインチューニング(Reinforcement Fine-Tuning)が、監視ファインチューニング(Supervised Fine-Tuning, SFT)を超えてToMを上昇させるかどうかを総合的に研究する。主な発見は以下の通りである。まず、Thinking-RFTは全てのシナリオにおいてToMを効果的に改善し、SFTよりも6%改善し、特に複雑な高次推論において、SFTよりも10%改善し、マルチモーダルケースでは7%改善した。また、非表示のドメインや高階のクエリに対して、反ファクトに対してより堅牢である点を特によく一般化する。第二に、ToMは推論とRLの結合効果から特に恩恵を受ける:Thinking-RFTは非シンキングRFTを平均7%上回っている。第三に、RFTはキーワードや状態変化など、因果的要因に対応するアンカーの手がかりに基づく推論を学習することで機能する。我々の研究は、効果的でロバストなToMポストトレーニングデータセットの開発と、重要なToM機能の向上に有用であると信じている。

論文の概要: From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

関連論文リスト