Fugu-MT 論文翻訳(概要): Dense Video Understanding with Gated Residual Tokenization

論文の概要: Dense Video Understanding with Gated Residual Tokenization

arxiv url: http://arxiv.org/abs/2509.14199v2
Date: Thu, 18 Sep 2025 13:17:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 13:12:58.960615
Title: Dense Video Understanding with Gated Residual Tokenization
Title（参考訳）: Gated Residual Tokenization を用いたDense Video Understanding
Authors: Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu,
Abstract要約: 高時間分解能は、ビデオ理解における微細な細部を捉えるのに不可欠である。現在のベンチマークは主に低フレームレートサンプリングに依存している。 Dense Video Understanding (DVU)は、トークン化時間とトークンオーバーヘッドの両方を削減することで、高FPSビデオの理解を可能にする。
参考スコア（独自算出の注目度）: 49.17263029080152
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.
Abstract（参考訳）: 高時間分解能は、ビデオ理解における微細な細部を捉えるのに不可欠である。しかしながら、現在のビデオ大言語モデル(VLLM)とベンチマークは、一様サンプリングやキーフレームの選択といった低フレームレートのサンプリングに大きく依存し、高密度の時間情報を破棄している。この妥協により、すべてのフレームをトークン化するコストが高くなり、ビデオ長が増加するにつれて冗長な計算と線形トークンの増大につながる。このトレードオフは、コンテンツをゆっくりと変えるのに有効だが、講義の理解のようなタスクには失敗する。このギャップに対処するために、トークン化時間とトークンオーバーヘッドの両方を削減することで、高FPSビデオの理解を可能にするDense Video Understanding (DVU)を導入する。既存のベンチマークも制限されており、QAペアは粗い内容の変更に焦点を当てている。そこで我々はDIVE (Dense Information Video Evaluation) を提案する。 DVUを実用的なものにするために、(1)動き補償型インターゲイト・トークン化(GRT)は、トークン化中に静的領域をスキップするためにピクセルレベルのモーション推定を使い、トークン数と計算においてサブ線形成長を達成する。 2)シーン内の静的領域間でトークンを融合させ,動的なセマンティクスを保ちながら冗長性を低下させる。 DIVEの実験では、GRTはより大きなVLLMベースラインを上回り、FPSと正にスケールすることを示した。これらの結果は、高精細時間情報の重要性を強調し、GRTが効率的でスケーラブルな高FPSビデオ理解を可能にすることを示す。

論文の概要: Dense Video Understanding with Gated Residual Tokenization

関連論文リスト