Fugu-MT 論文翻訳(概要): Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

論文の概要: Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

arxiv url: http://arxiv.org/abs/2509.08016v1
Date: Tue, 09 Sep 2025 00:55:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-11 17:24:19.820777
Title: Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Title（参考訳）: Video Parallel Scaling: ビデオLLMのための横フレームサブセットの集約
Authors: Hyungjin Chung, Hyelin Nam, Jiyeon Kim, Hyojun Go, Byeongjun Park, Junho Kim, Joonseok Lee, Seongsu Ha, Byung-Hoon Kim,
Abstract要約: Video Parallel Scaling (VPS) は、コンテキストウインドウを増大させることなく、モデルの知覚帯域を拡大する推論時手法である。 VPSは複数の並列推論ストリームを実行することで動作し、それぞれがビデオのフレームのユニークな非結合サブセットを処理する。この手法は,非相関な視覚的証拠を活用することで,チンチラスケーリング法を効果的に適用できることが示唆された。
参考スコア（独自算出の注目度）: 47.42197619278693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model's perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video's frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.
Abstract（参考訳）: Video Large Language Models (VideoLLMs) は重要なボトルネックに直面している: 微粒な時間的詳細を捉えるために入力フレームの数が増加すると、計算コストが禁じられ、長いコンテキスト長のパフォーマンスが低下する。本稿では,映像パラレルスケーリング(VPS, Video Parallel Scaling)を提案する。 VPSは複数の並列推論ストリームを実行することで動作し、それぞれがビデオのフレームのユニークな非結合サブセットを処理する。これらの補完ストリームから出力確率を集約することにより、VPSは単一のパスで可能なよりもリッチな視覚情報を統合できる。提案手法は,非相関な視覚的エビデンスを活用し,付加的なトレーニングを伴わずに性能を向上させることにより,効果的にチンチラスケーリング法を適用できることを理論的に示す。 Video-MMEやEventHallusionなどのベンチマークで、さまざまなモデルアーキテクチャとスケール(2B-32B)にわたる大規模な実験は、VPSが一貫して、パフォーマンスを大幅に改善することを示した。他の並列的な代替手段(例えばSelf-Consistency)よりも好意的にスケールし、他のデコード戦略を補完し、VideoLLMの時間的推論能力を高めるためのメモリ効率と堅牢なフレームワークを提供する。

論文の概要: Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

関連論文リスト