Fugu-MT 論文翻訳(概要): VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

論文の概要: VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

arxiv url: http://arxiv.org/abs/2605.30256v1
Date: Thu, 28 May 2026 17:20:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.581868
Title: VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
Title（参考訳）: VideoFDB:会話エージェントにおけるフルダブルビジョン音声機能の評価
Authors: Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello,
Abstract要約: 人間とエージェントの対話を成功させるためには、完全な視覚的会話をモデル化する必要がある。 AV2AVインタラクションの完全なオーディオヴィジュアルキャプションのための最初のベンチマークとして、VideoFDBは体系的な評価と進歩の基礎を確立している。
参考スコア（独自算出の注目度）: 27.786319380559515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.
Abstract（参考訳）: 人は同時に話し、耳を傾けながら、うなずき、笑顔、ジェスチャーなどの非言語的な手がかりを継続的に解釈し、生成する。人間とエージェントの対話を成功させるためには、エージェントはフルダブルプレックスの音声視覚会話をモデル化する必要があるが、既存のフルダブルプレックスのベンチマークでは音声のみを評価する必要がある。本研究では,AV2AV対話エージェントを2重に評価する最初のベンチマークであるVideoFDBを提案する。 VideoFDBが貢献 (i)現実世界のビデオ通話から11の非言語会話のダイナミックスにまたがる237のダイアディッククリップ。二世代行動と認識を区別する分類 3)非言語的会話力学に対する会話の質を評価するための解釈可能な軸を用いたルーブリックに基づくLM-as-judge評価フレームワーク。オープンおよびクローズド・ソース・ヴィジュアル・音声エージェント全体では, 打倒崩壊と視覚ストリームの無知という, 系統的な障害モードがみられ, 現状のシステムは, 自然な会話に必要なストリーミングジョイント・ヴィジュアルグラウンドに対してではなく, 視覚を明示的な視覚的質問応答に活用していることを示す。さらに,ケースド・トゥ・アバター・システムの評価を行い,そのアーキテクチャが全二重言語的非言語的手がかりの生成を根本的に妨げていることを見出した。フル二重AV2AVインタラクションの最初のベンチマークとして、VideoFDBは体系的評価の基礎を確立し、次世代多モード対話エージェントの進歩と開発を加速することを期待している。

論文の概要: VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

関連論文リスト