Fugu-MT 論文翻訳(概要): Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

論文の概要: Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

arxiv url: http://arxiv.org/abs/2605.17360v1
Date: Sun, 17 May 2026 09:57:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.921585
Title: Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction
Title（参考訳）: Omni-DuplexEval:リアルタイム二重モード相互作用の評価
Authors: Chaoqun He, Mingyang Xiang, Yingjing Xu, Bokai Xu, Junbo Cui, Jie Zhou, Yuan Yao, Lijie Wen,
Abstract要約: 実世界のシナリオで動作するマルチモーダルAIシステムには、リアルタイムデュプレックスインタラクションが不可欠である。 Omni-DuplexEvalは,実時間二重相互作用を体系的に評価するためのベンチマークである。
参考スコア（独自算出の注目度）: 18.498258537382416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.
Abstract（参考訳）: リアルタイムデュプレックスインタラクションは、モデルがストリーミング入力を継続的に処理し、適切なタイミングで応答しなければならない、実世界のシナリオで動作するマルチモーダルAIシステムにとって不可欠である。しかし、既存のマルチモーダル大言語モデル(MLLM)はオフライン設定で評価され、ビデオ全体の入力は応答が生成される前に処理される。最近の研究では、リアルタイムな2次元MLLMの探索が始まっているが、この設定に対する包括的なベンチマークや自動評価手法はいまだに存在しない。このギャップに対処するために,実時間二重相互作用を体系的に評価するベンチマークであるOmni-DuplexEvalを提案する。このベンチマークは,(1) 時間記述(Real-Time Description),(2) 時間記述(Real-Time Description),(2) 時間記述(Real-Time Description),(2) 時間記述(Reactive Reminder)の2つの相補的なシナリオから構成される。 Omni-DuplexEvalには660のビデオがあり、細粒度で人間の注釈付きラベルと正確な時間的メタデータがあり、現実世界のシナリオに根ざした9つのタスクにまたがっている。さらに, LLM-as-a-Judgeに基づく自動評価フレームワークを導入し, タイムスタンプ認識とシーケンシャル推論による応答内容のアライメントと応答タイミングを協調的に評価し, 人的判断と強力なアライメントを実現する。最先端のデュプレックスMLLMの実験では、かなりの制限が示される。ベストパフォーマンスモデルは全体の39.6%に過ぎず、プロアクティブリマインダーでは20.0%に留まる。モデルは、一貫性のある、全体的なコンテンツ生成とタイムリーなレスポンスのバランスをとるのに苦労する。 MLLMのさらなる進歩を促進することを願っています。

論文の概要: Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

関連論文リスト