Fugu-MT 論文翻訳(概要): WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

論文の概要: WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

arxiv url: http://arxiv.org/abs/2602.22142v1
Date: Wed, 25 Feb 2026 17:45:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-26 18:19:16.93326
Title: WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs
Title（参考訳）: WeaveTime: 以前のフレームからビデオLLMの創発的なメモリへのストリーム
Authors: Yulin Zhang, Cheng Shi, Sibei Yang,
Abstract要約: WeaveTimeは、シンプルで効率的でモデルに依存しないフレームワークで、まず注文を教え、次に注文を使用する。推論では、パスCurrent Dynamic Focus Cacheは不確実性トリガ、粗い粒度検索を実行し、必要なときにだけ履歴を拡大する。これらの結果はWeaveTimeを、厳格なオンライン時間因果制約の下でビデオ-LLMをストリームする時間意識への実践的なパスとして確立する。
参考スコア（独自算出の注目度）: 37.61875409530676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/
Abstract（参考訳）: マルチモーダル大規模言語モデルの最近の進歩は、視覚的理解と推論を大幅に改善しているが、その二次的注意とオフライントレーニングプロトコルは、フレームが順次到着し、将来の観測が不可能なストリーミング設定に不適である。我々は、現在のビデオLLM(Time-Agnosticism)の中核的な限界を診断し、ビデオは因果的に順序づけられたシーケンスではなく、無秩序な証拠の袋として扱われ、ストリームに2つの失敗をもたらす。 WeaveTimeは、シンプルで効率的でモデルに依存しないフレームワークで、まず注文を教え、次に注文を使用する。本稿では、最小限の微調整と特別なストリーミングデータのない順序認識表現を具現化する、軽量な時間再構成目標-our Streaming Order Perception拡張を導入する。推論では、パスCurrent Dynamic Focus Cacheは不確実性トリガ、粗い粒度検索を実行し、必要なときにだけ履歴を拡大する。アーキテクチャの変更なしにVideo-LLMの出力にプラグインされたWeaveTimeは、代表的なストリーミングベンチマークで一貫したゲインを提供し、レイテンシの低減と精度の向上を実現している。これらの結果はWeaveTimeを、厳格なオンライン時間因果制約の下でビデオ-LLMをストリームする時間意識への実践的なパスとして確立する。コードと重みは公開されます。 Project Page: https://zhangyl4.github.io/publications/weavetime/

論文の概要: WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

関連論文リスト