Fugu-MT 論文翻訳(概要): SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

論文の概要: SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

arxiv url: http://arxiv.org/abs/2603.16859v1
Date: Tue, 17 Mar 2026 17:58:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.470637
Title: SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
Title（参考訳）: SocialOmni:Omniモデルにおけるオーディオと視覚の社会的相互作用のベンチマーク
Authors: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji,
Abstract要約: Social Omniは、3つのコア次元にわたる対話性の評価を運用するベンチマークである。 Social Omniは2000の知覚サンプルと209の相互作用生成インスタンスの品質管理された診断セットを備えている。本分析により,モデルの知覚的精度と,文脈的に適切な割り込みを生成する能力との間に顕著な疎結合が明らかとなった。
参考スコア（独自算出の注目度）: 86.19617358080016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
Abstract（参考訳）: Omni-Modal Large Language Model (OLM) は、音声、視覚、テキストをネイティブに統合することにより、人間と機械の相互作用を再定義する。しかし、既存のOLMベンチマークは、静的な精度中心のタスクに固定されており、自然な対話における動的なキューをナビゲートする基本的な能力である、社会的相互作用を評価する上で重要なギャップを残している。この目的のために,3次元にわたる対話性の評価を運用する総合的なベンチマークであるSocialOmniを提案する。一話者の分離及び識別(話し手) 二中断タイミング制御(介在時)、及び三自然割り込み発生(割り込みの言い方) SocialOmniは2,000の知覚サンプルと、厳密な時間的制約と文脈的制約を備えた209のインタラクション生成インスタンスの品質管理された診断セットを備えている。我々は、モデル間での社会的相互作用能力の大きな差異を明らかにする12の主要なOLMをベンチマークした。さらに,モデルの知覚的精度と文脈的に適切な割り込みを生成する能力との間に顕著な疎結合が明らかとなり,理解中心の指標だけでは会話の社会的能力を特徴づけるには不十分であることが示唆された。より奨励的に、SocialOmniのこれらの診断は、将来のOLMにおける知覚と相互作用の分断をブリッジするために実行可能な信号をもたらす。

論文の概要: SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

関連論文リスト