Fugu-MT 論文翻訳(概要): VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

論文の概要: VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

arxiv url: http://arxiv.org/abs/2603.17948v1
Date: Wed, 18 Mar 2026 17:20:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.84965
Title: VideoAtlas: Navigating Long-Form Video in Logarithmic Compute
Title（参考訳）: VideoAtlas:対数計算で長時間の動画をナビゲートする
Authors: Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan,
Abstract要約: textbfVideoAtlasは、動画を階層的なグリッドとして表現するためのタスクに依存しない環境である。階層構造により、アクセス深度はビデオ長と対数的にのみ増大する。ビデオRLMは1時間から10時間に及ぶベンチマークのスケーリングにおいて、最小限の精度の劣化を伴う最も長い時間ロバストな方法である。
参考スコア（独自算出の注目度）: 3.705718227493618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.
Abstract（参考訳）: 言語モデルをビデオに拡張することは、既存のメソッドが損失の少ない近似に依存する表現と、キャプションやエージェントベースのパイプラインが動画をテキストに分解して視覚的忠実さを失う長いコンテキストの2つの課題をもたらす。そこで本稿では,映像を階層的グリッドとして表現するタスク非依存環境である‘textbf{VideoAtlas} を導入する。ビデオの概要は一見可能で、任意の領域に再帰的にズームインすることができ、ビデオ、中間調査、エージェントのメモリに一様に使用されるのと同じ視覚的表現で、損失のあるテキスト変換をエンドツーエンドに排除することができる。この階層構造により、アクセス深度はビデオ長と対数的にのみ増大する。長いコンテキストでは、Recursive Language Models (RLMs) は、最近、長いテキストのための強力なソリューションを提供しているが、それらをビジュアルドメインに拡張するには、再帰する構造化された環境が必要であり、それが \textbf{VideoAtlas} が提供する。 Markov Decision Process としての \textbf{VideoAtlas} は Video-RLM をアンロックする。 1)-対数計算の長寿命化,およびグリッドの構造的再利用による30～60 %のマルチモーダルキャッシュヒット率の増幅,の3つの重要な結果を示す。 2) 最大探査深度を境界とする環境予算化は, 原理的計算精度ハイパーパラメータを提供する。 (3) 質問の粒度に応じてスケールする適応的な計算割り当て。 1時間から10時間のベンチマークにスケールする場合、ビデオ-RLMは、最小限の精度の劣化を伴う最も持続時間の浪費法であり、構造化環境ナビゲーションがビデオ理解のための実行可能なスケーラブルなパラダイムであることを実証している。

論文の概要: VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

関連論文リスト