Fugu-MT 論文翻訳(概要): ResidualViT for Efficient Temporally Dense Video Encoding

論文の概要: ResidualViT for Efficient Temporally Dense Video Encoding

arxiv url: http://arxiv.org/abs/2509.13255v1
Date: Tue, 16 Sep 2025 17:12:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:53.188916
Title: ResidualViT for Efficient Temporally Dense Video Encoding
Title（参考訳）: リアルタイムビデオ符号化のためのResidualViT
Authors: Mattia Soldan, Fabian Caba Heilbron, Bernard Ghanem, Josef Sivic, Bryan Russell,
Abstract要約: 我々は,時間的に密集したタスクに対する計算機能のコスト削減に3つの貢献をしている。まず、ビデオの時間的冗長性を活用するビジョントランスフォーマー(ViT)アーキテクチャ、ResidualViTを紹介する。第2に,原基礎モデルのフレームレベルの特徴を近似する軽量蒸留方式を提案する。
参考スコア（独自算出の注目度）: 66.57779133786131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require "temporally dense" reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.
Abstract（参考訳）: 自然言語の時間的ビデオグラウンド、時間的活動のローカライゼーション、音声記述生成などのビデオ理解タスクでは、高時間分解能でサンプリングされたフレームに対して「時間的に密集した」推論を必要とする。しかし、時間分解能の要求を考えると、これらのタスクのフレームレベル機能は計算に高価である。本稿では,時間的に密なタスクに対する計算コストの削減に3つの貢献を行う。まず、ビデオの時間的冗長性を利用して、時間的に密度の高いフレームレベルの特徴を効率的に計算する、ResidualViTと呼ばれるビジョントランスフォーマー(ViT)アーキテクチャを導入する。私たちのアーキテクチャは組み込まれています一連続するフレーム間の時間的整合性を確保する学習可能な残留接続二予め訓練された基礎モデルの重みを再利用しつつ、時間的に冗長な情報を選択的に破棄し、処理速度を向上させるトークン低減モジュール。第2に,原基礎モデルのフレームレベルの特徴を近似する軽量蒸留方式を提案する。最後に、ゼロショットとフル教師付き設定の両方において、4つのタスクと5つのデータセットにわたるアプローチを評価し、計算コストの大幅な削減(最大60%)と推論速度の改善(最大2.5倍高速)を示しながら、元の基礎モデルの精度を綿密に評価した。

論文の概要: ResidualViT for Efficient Temporally Dense Video Encoding

関連論文リスト