Fugu-MT 論文翻訳(概要): Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)

論文の概要: Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)

arxiv url: http://arxiv.org/abs/2509.13161v1
Date: Tue, 16 Sep 2025 15:13:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:53.14448
Title: Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)
Title（参考訳）: 構造化マルチビデオ協調推論によるビデオ大言語モデルの強化(初期バージョン)
Authors: Zhihao He, Tianyao He, Tieyuan Chen, Yun Xu, Huabin Liu, Chaofan Gan, Gui Zou, Weiyao Lin,
Abstract要約: 有望な解決策は、複数の関連ビデオによる推論のパフォーマンス向上である。ビデオトークンは多数あり、冗長な情報を含んでいる。ビデオ言語モデルのためのマルチビデオ協調フレームワークを提案する。
参考スコア（独自算出の注目度）: 18.484276267960436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video's knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models.
Abstract（参考訳）: ビデオ言語モデルの繁栄にもかかわらず、現在の包括的なビデオ推論の追求は、個々のビデオに固有の時空間的不完全性によって妨げられ、幻覚と不正確な結果をもたらす。有望な解決策は、複数の関連ビデオによる推論のパフォーマンス向上である。しかし、ビデオトークンは多種多様であり、冗長な情報を含んでいるため、対応性を高めるために関連ビデオデータを大きな言語モデルに直接供給することは非生産的である可能性がある。この課題に対処するために,ビデオ言語モデルのためのマルチビデオ協調フレームワークを提案する。効率的なフレキシブルなビデオ表現のために,ビデオの知識を時空間グラフとして表現するためのビデオ構造化モジュールを構築した。構造化されたビデオ表現に基づいて、構造化された知識と、関連するビデオから貴重な情報を付加されたグラフノードトークンに融合するグラフ融合モジュールを設計する。最後に、大規模言語モデルへの入力として、グラフ、ビジュアル、テキストトークンを統合するための、精巧なマルチビデオ構造化プロンプトを構築する。大規模な実験により、我々のフレームワークの有効性が実証され、ビデオ言語モデルの発展のための有望な道としての可能性を示している。

論文の概要: Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)

関連論文リスト