Fugu-MT 論文翻訳(概要): RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

論文の概要: RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

arxiv url: http://arxiv.org/abs/2505.02064v2
Date: Tue, 06 May 2025 01:51:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-07 12:42:37.96075
Title: RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
Title（参考訳）: RTV-Bench: リアルタイムビデオによるMLLMの継続的知覚・理解・推論のベンチマーク
Authors: Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu,
Abstract要約: RTV-BenchはMLLMリアルタイムビデオ解析のためのきめ細かいベンチマークである。 RTV-Benchは552の多様なビデオ(167.2時間)と4,631の高品質QAペアを含んでいる。
参考スコア（独自算出の注目度）: 19.373906873461703
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、認識、理解、推論に優れる。しかし、現在のベンチマークでは、動的で現実的な環境でこれらのタスクを継続的に実行する能力が不十分である。このギャップを埋めるために、MLLMリアルタイムビデオ解析のための詳細なベンチマークであるRTV-Benchを導入する。 RTV-Benchは,(1)マルチタイムスタンプ質問回答(MTQA, Multi-Timestamp Question Answering),(2)基本的なクエリと高度なクエリを組み合わせた階層的質問構造,(3)継続的知覚,理解,推論の能力を評価する多次元評価という,3つの重要な原則を使用している。 RTV-Benchは552の多様なビデオ(167.2時間)と4,631の高品質QAペアを含んでいる。我々は,プロプライエタリ (GPT-4o, Gemini 2.0), オープンソースオフライン (Qwen2.5-VL, VideoLLaMA3), オープンソースリアルタイム (VITA-1.5, InternLM-XComposer2.5-OmniLive) など,MLLMの主要なモデルについて検討した。実験の結果、オープンソースのリアルタイムモデルは、オフラインモデルよりもはるかに優れていますが、依然としてトップクラスのプロプライエタリモデルを追い越しています。また, モデルサイズやフレームサンプリング率の増大はRTV-Benchの性能を著しく向上させず, 時には若干の低下を招いた。このことは、MLLMによるリアルタイムビデオ解析を進歩させるために、ビデオストリーム処理や長いシーケンスに最適化されたより良いモデルアーキテクチャの必要性を浮き彫りにしている。私たちのベンチマークツールキットは、https://github.com/LJungang/RTV-Bench.comで利用可能です。

論文の概要: RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

関連論文リスト