Fugu-MT 論文翻訳(概要): How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding

論文の概要: How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding

arxiv url: http://arxiv.org/abs/2606.20726v1
Date: Wed, 17 Jun 2026 03:30:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 13:30:50.695602
Title: How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding
Title（参考訳）: ビデオモデルってどんな感じ? 長いビデオ理解でメモリ消費のトレードオフを計測する
Authors: Yixian Tian,
Abstract要約: 本稿では,長時間ビデオ理解におけるフレーム予算Bと時間距離Dの関数として,解答精度がいかに低下するかを定量化する,コンパクトな経験モデルを提案する。ロングフォームモデルは厳格な予算の下で運用されるが、Bが縮小しイベントが後退するにつれて精度が低下すると予想する事前のフレームワークは存在しない。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a compact empirical model that quantifies how answer accuracy degrades as a function of frame budget B and temporal distance D in long video understanding -- analyzing performance when recalling content from D seconds in the past using a fraction B of total frames. Long-form models operate under strict budgets, yet no prior framework predicts how accuracy degrades as B shrinks and events recede. We fit a weighted least-squares model on ~155,000 binary predictions across ten models and three sampling strategies, deriving a law where logit-accuracy scales linearly in log-budget with a distance-dependent exponent that decays log-linearly with distance. This budget exponent α(D) captures the marginal value of extra frames at distance D. The law achieves cell-level weighted R^2 = 0.05-0.75 across models. Notably, budget effectiveness at D = 1000 s differs by \approx 7.4\times between the best streaming and base models. STREAMINGVLM achieves α(1000) = 1.26 (95% CI: [1.06, 1.58]), meaning a tenfold budget increase substantially improves long-distance accuracy, while the best Qwen3-VL base model reaches only α(1000) = 0.17 (CI: [0.04, 0.34]). In accuracy space, a 10\times budget increase at D = 1000 s yields +29 percentage points for STREAMINGVLM versus +4 pp for the base model. Sampling strategies show model-dependent trade-offs: random sampling yields higher base sensitivity but steeper distance decay. We demonstrate how α(D) enables principled budget allocation, including a model-ranking reversal at long distance, and propose it as a diagnostic metric for streaming video models.
Abstract（参考訳）: 本稿では,フレーム予算Bと時間距離Dの関数として解答精度がいかに低下するかを,フレーム全体の分数Bを用いて,過去のD秒からのコンテンツリコール時の性能を定量的に分析する実験モデルを提案する。ロングフォームモデルは厳格な予算の下で運用されるが、Bが縮小しイベントが後退するにつれて精度が低下すると予想する事前のフレームワークは存在しない。重み付き最小二乗モデルを10つのモデルと3つのサンプリング戦略で155,000のバイナリ予測に適用し、対数精度が対数予算で線形にスケールする法則と距離依存指数で対数直線的に崩壊する距離依存指数を導出する。この予算指数 α(D) は距離 D で余剰フレームの限界値を取得する。この法則はセルレベル重み付き R^2 = 0.05-0.75 をモデル全体で達成する。特に、D = 1000秒の予算効果は、最良のストリーミングモデルとベースモデルの間では、 \approx 7.4\times によって異なる。 STREAMINGVLM は α(1000) = 1.26 (95% CI: [1.06, 1.58]) を達成し、10倍の予算増により長距離精度が大幅に向上し、最高の Qwen3-VL ベースモデルは α(1000) = 0.17 (CI: [0.04, 0.34]) となる。精度空間では、D = 1000 sでの10\timesの予算増加は、STREAMINGVLMでは+29ポイント、ベースモデルでは+4ppポイントとなる。サンプリング戦略はモデルに依存したトレードオフを示す: ランダムサンプリングはより高いベース感度を得るが、より急な距離減衰をもたらす。本稿では, 遠距離でのモデルレベルの逆転を含む, α(D) が基本予算配分を実現する方法を示し, ストリーミングビデオモデルの診断基準として提案する。

論文の概要: How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding

関連論文リスト