Fugu-MT 論文翻訳(概要): QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

論文の概要: QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

arxiv url: http://arxiv.org/abs/2505.16175v1
Date: Thu, 22 May 2025 03:26:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-23 17:12:48.010626
Title: QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Title（参考訳）: QuickVideo:システムアルゴリズムの共同設計によるリアルタイムビデオ理解
Authors: Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen,
Abstract要約: ビデオ監視、会議要約、教育講義分析、スポーツ放送といった現実の応用において、ロングビデオ理解が重要な機能として現れてきた。我々は,リアルタイムダウンストリームアプリケーションをサポートするために,長時間ビデオ理解を大幅に高速化するシステムアルゴリズムの共同設計であるQuickVideoを提案する。
参考スコア（独自算出の注目度）: 54.38970077613728
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.
Abstract（参考訳）: ビデオ監視、会議要約、教育講義分析、スポーツ放送といった現実の応用において、ロングビデオ理解が重要な機能として現れてきた。しかし、主に2つのボトルネックのために、VideoLLMsは計算的に禁止されている。 1) 逐次ビデオ復号化、生のビットストリームをRGBフレームに変換するプロセスは、1時間の動画入力に1分以上かかり得る。 2) LLM推論のために最大数百万のトークンを高コストでプリフィルすることで、高いレイテンシとメモリ使用率を実現した。これらの課題に対処するために,リアルタイムダウンストリームアプリケーションをサポートするために,長時間ビデオ理解を大幅に高速化するシステムアルゴリズムの共同設計であるQuickVideoを提案する。並列化されたCPUベースのビデオデコーダであるQuickDecoderは、ビデオをキーフレーム整列インターバルに並列処理することで2～3倍のスピードアップを達成する。これらのコンポーネントは、長いビデオ入力で1分短縮され、限られたハードウェアでもスケーラブルで高品質なビデオ理解が可能になる。実験により、QuickVideoは持続時間とサンプリングレートをまたいで一般化し、長いビデオ処理を実際に実現可能であることが示された。

論文の概要: QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

関連論文リスト