Fugu-MT 論文翻訳(概要): Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

論文の概要: Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

arxiv url: http://arxiv.org/abs/2606.05896v1
Date: Thu, 04 Jun 2026 09:03:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.67525
Title: Resonant Minds: Closed-Loop Social Avatars with Theory of Mind
Title（参考訳）: 共鳴心:心の理論を持つ閉ループ社会アバター
Authors: Jianxu Shangguan, Jing Xu, Hang Ye, Xiaoxuan Ma, Yizhou Wang, Wentao Zhu,
Abstract要約: 本稿では,認識,社会的推論,表現を連続的な相互作用サイクルに統合した閉ループ二重エージェントフレームワークを提案する。知覚モジュールは、パートナーのマルチモーダルな振る舞いをビデオから分析し、社会的推論モジュールは、心の理論を通して隠れた精神状態を予測する。その後、表情モジュールは、リスナーの反応行動とともに、話者音声と表現の両方の感情制御可能なデュアルエージェントビデオを生成する。
参考スコア（独自算出の注目度）: 16.880605576970538
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.
Abstract（参考訳）: 本物の社会的知性を持つ生活のようなデジタル人間を作るには、コヒーレントな枠組みの中で認知的推論とマルチモーダルな生成を統一する必要がある。大規模言語モデルは対話において優れるが、具体的表現が欠如する一方、拡散に基づく発話ヘッドモデルは視覚的忠実性を達成するが、社会的認知を無視する。このギャップを埋めるために、認識、社会的推論、表現を連続的な相互作用サイクルに統合する閉ループ二重エージェントフレームワークを提案する。知覚モジュールは、パートナーのマルチモーダルな振る舞いをビデオから分析し、社会的推論モジュールは、心の理論を通して隠れた精神状態を推論し、アンサンブル機構を介して応答を選択する。その後、表情モジュールは、音声と表情の両方を合成する感情制御可能なデュアルエージェントビデオを生成し、リスナーの反応行動と合わせて、先行作業で欠落した双方向のダイナミクスをキャプチャする。本研究では,情報非対称性に基づく評価を支援するために,心理的根拠のあるペルソナと個人的社会的目標を備えた階層型ペルソナ・セサリオデータセットを構築した。このデータセットの実験は、対話の品質とビデオ生成の指標の両方において、競争力または優れたパフォーマンスを示す。特に,本手法は重要な対話品質次元のフルインフォームスクリプトモードを超越し,不確実性条件下での明示的な精神状態推論により,制限のない情報アクセスよりも思考力の高い対話が引き起こされる可能性が示唆された。

論文の概要: Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

関連論文リスト