Fugu-MT 論文翻訳(概要): Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

論文の概要: Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

arxiv url: http://arxiv.org/abs/2606.15152v1
Date: Sat, 13 Jun 2026 06:44:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:32.945004
Title: Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation
Title（参考訳）: エージェントは部屋を読むことができるか? マルチモーダルシミュレーションにおけるビジュアルソーシャルインテリジェンスの評価
Authors: Shijun Wan, Xuehai Wu, Jiwen Zhang, Siyuan Wang, Zhongyu Wei,
Abstract要約: 既存のソーシャルエージェントベンチマークは、主にテキストベースであり、マルチモーダルエージェントが視覚的手がかりを使ってインタラクションをガイドできるかどうかを検査することは滅多にない。マルチモーダル・ソーシャル・シミュレーションにおける視覚的ソーシャル・インテリジェンスを評価するベンチマークであるtextscbenchmarkname を導入する。
参考スコア（独自算出の注目度）: 38.36111181883569
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.
Abstract（参考訳）: 社会的相互作用は、表情、姿勢、視線、感情の変化など、言語と可視的な社会信号の両方に依存する。しかし、既存のソーシャルエージェントベンチマークは、主にテキストベースであり、マルチモーダルエージェントが視覚的手がかりを使ってインタラクションをガイドできるかどうかをテストすることは滅多にない。マルチモーダル・ソーシャル・シミュレーションにおける視覚的ソーシャル・インテリジェンスを評価するベンチマークである「textsc{\benchmarkname{}}」を紹介する。これには240のシナリオ、585のロールインスタンス、2,340のロールタスクインスタンスが含まれており、整列されたテキスト-視覚的エビデンス、構造化されたロールプロファイル、および4つのロールレベルタスク(式タスク、特性タスク、相互作用制御タスク、相互作用結果タスク)が組み合わされている。言語化されたビジョンと直接ビジョンの下での7つのMLLMの評価は、局所的な役割遂行と相互作用管理の間に明確なギャップがあることを明らかにしている: 役割特異的表現と競合処理は、ほぼ飽和状態にあるが、相互作用制御と視覚的に基盤付けられた成果達成は、かなり難しいままである。コードはhttps://github.com/JunsWan/AgentViSSでリリースされ、データセットはhttps://huggingface.co/datasets/JunsWan/AgentViSSで公開されている。

論文の概要: Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

関連論文リスト