Fugu-MT 論文翻訳(概要): QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models

論文の概要: QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models

arxiv url: http://arxiv.org/abs/2601.06573v1
Date: Sat, 10 Jan 2026 13:42:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:00.883235
Title: QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models
Title（参考訳）: QMAVIS:大規模マルチモーダルモデルの融合による長時間ビデオオーディオ理解
Authors: Zixing Lin, Jiale Wang, Gee Wah Ng, Lee Onn Mak, Chan Zhi Yang Jeriel, Jun Yang Lee, Yaohao Li,
Abstract要約: QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking) は、LMM、Large Language Model、音声認識モデルの後期融合によって構築された、新しい長大ビデオオーディオ理解パイプラインである。 QAVISは、長いビデオ分析のギャップに対処し、特に数分から1時間以内の長いビデオでは、センスメイキング、ビデオコンテンツ分析、AIの具体化など、新たな応用の可能性を開く。
参考スコア（独自算出の注目度）: 5.182512564299702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) for video-audio understanding have traditionally been evaluated only on shorter videos of a few minutes long. In this paper, we introduce QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking), a novel long video-audio understanding pipeline built through a late fusion of LMMs, Large Language Models, and speech recognition models. QMAVIS addresses the gap in long-form video analytics, particularly for longer videos of a few minutes to beyond an hour long, opening up new potential applica- tions in sensemaking, video content analysis, embodied AI, etc. Quantitative experiments using QMAVIS demonstrated a 38.75% improvement over state-of-the-art video-audio LMMs like Vide- oLlaMA2 and InternVL2 on the VideoMME (with subtitles) dataset, which comprises long videos with audio information. Evaluations on other challenging video understanding datasets like PerceptionTest and EgoSchema saw up to 2% improvement, indicating competitive performance. Qualitative experiments also showed that QMAVIS is able to extract the nuances of different scenes in a long video audio content while understanding the overarching narrative. Ablation studies were also conducted to ascertain the impact of each component in the fusion pipeline.
Abstract（参考訳）: ビデオ音声理解のための大規模マルチモーダルモデル(LMM)は、伝統的に数分の短いビデオでのみ評価されてきた。本稿では,LMM,Large Language Model,音声認識モデルの後期融合によって構築された,QMAVIS(Q Team-Multimodal Audio Video Intelligent Sensemaking)について紹介する。 QMAVISは、長いビデオ分析のギャップ、特に数分から1時間以内の長いビデオでは、センスメイキング、ビデオコンテンツ分析、AIの具体化など、新たな応用の可能性を開く。 QMAVISを用いた定量的実験では、Vide- oLlaMA2やInternVL2のような最先端のオーディオLMMよりも38.75%改善された。 PerceptionTestやEgoSchemaといった、他の困難なビデオ理解データセットの評価では、最大2%の改善があり、競争力のあるパフォーマンスを示している。質的な実験により、QMAVISは長いビデオ音声コンテンツの中で様々なシーンのニュアンスを抽出し、物語の全体を理解することができることがわかった。核融合パイプラインにおける各成分の影響を確認するためのアブレーション試験も行った。

論文の概要: QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models

関連論文リスト