Fugu-MT 論文翻訳(概要): Video Panels for Long Video Understanding

論文の概要: Video Panels for Long Video Understanding

arxiv url: http://arxiv.org/abs/2509.23724v1
Date: Sun, 28 Sep 2025 08:05:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.401172
Title: Video Panels for Long Video Understanding
Title（参考訳）: 長いビデオ理解のためのビデオパネル
Authors: Lars Doorenbos, Federico Spurio, Juergen Gall,
Abstract要約: 本稿では,長時間ビデオ理解に特化して設計された視覚的プロンプト戦略を提案する。複数のフレームを1つの画像に組み合わせることで、時間分解能の空間的詳細を効果的に取り除くことができる。我々のアプローチは、トレーニングフリー、パラメータフリー、モデル非依存であり、既存のビデオ言語モデルにシームレスに統合できる。
参考スコア（独自算出の注目度）: 25.560912635941662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. % additional training time. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4\%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.
Abstract（参考訳）: 最近のビデオ言語モデル(VLM)は、長いビデオ理解において有望な結果をもたらすが、画像やショートビデオを含むタスクで達成されたパフォーマンスは、まだ遅れている。これにより、新しいモジュールの導入と複雑さの追加により、VLMの長期コンテキストモデリングの改善に大きな関心が寄せられている。 %増量した。本稿では,限られたデータしか持たないVLMを微調整するのではなく,既存のモデルの性能を最大化しようと試みる。そこで本研究では,映像理解に特化して設計された視覚的プロンプト戦略を提案する。複数のフレームを1つの画像に組み合わせることで、時間分解能の空間的詳細を効果的に取り除くことができる。我々のアプローチは、トレーニングフリー、パラメータフリー、モデル非依存であり、既存のVLMにシームレスに統合できる。幅広いモデルアーキテクチャ、サイズ、コンテキストウィンドウにまたがる5つの確立されたベンチマークに関する大規模な実験は、我々のアプローチの一貫性を確認します。最長ビデオを持つTimeScope(Long)データセットでは、ビデオ質問応答の精度が19.4\%向上している。全体として,本手法は長大な映像理解モデルのバーを高くする。私たちは受け入れ次第コードを利用できるようにします。

論文の概要: Video Panels for Long Video Understanding

関連論文リスト