Fugu-MT 論文翻訳(概要): RELIC: Interactive Video World Model with Long-Horizon Memory

論文の概要: RELIC: Interactive Video World Model with Long-Horizon Memory

arxiv url: http://arxiv.org/abs/2512.04040v1
Date: Wed, 03 Dec 2025 18:29:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-04 20:02:55.422022
Title: RELIC: Interactive Video World Model with Long-Horizon Memory
Title（参考訳）: RELIC:ロングホライゾンメモリを用いたインタラクティブビデオワールドモデル
Authors: Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan,
Abstract要約: 真のインタラクティブな世界モデルは、リアルタイムの長距離ストリーミング、一貫した空間記憶、正確なユーザ制御を必要とする。この3つの課題を完全に解決する統合フレームワークであるRELICを紹介します。単一の画像とテキスト記述が与えられた後、RELICは任意のシーンをリアルタイムにメモリを意識した長期探索を可能にする。
参考スコア（独自算出の注目度）: 74.81433479334821
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.
Abstract（参考訳）: 真にインタラクティブな世界モデルは、リアルタイムの長距離ストリーミング、一貫した空間記憶、正確なユーザー制御という3つの重要な要素を必要とする。しかし、既存のほとんどのアプローチは、これら3つを同時に達成することは非常に困難であり、例えば、長期記憶機構は、しばしばリアルタイムのパフォーマンスを劣化させる。本稿では,これら3つの課題を完全に解決する統一フレームワークであるRELICを紹介する。単一の画像とテキスト記述が与えられた後、RELICは任意のシーンをリアルタイムにメモリを意識した長期探索を可能にする。近年の自己回帰式ビデオ拡散蒸留技術に基づいて,KVキャッシュ内での絶対的なカメラポーズと相対的な動作を符号化した,高度に圧縮された歴史的潜在トークンを用いて,長期記憶を表現する。このコンパクトでカメラ対応のメモリ構造は、暗黙の3D一貫性のあるコンテンツ検索をサポートし、計算オーバーヘッドを最小限に抑えながら長期的なコヒーレンスを強制する。並行して、双方向の教師ビデオモデルを微調整して、元の5秒のトレーニング地平線を超えてシーケンスを生成し、それを、長期教師と長期学生のセルフロールアウトでフルコンテクストの蒸留を可能にする新しいメモリ効率の自己強制パラダイムを用いて、因果学生ジェネレータに変換する。 14Bパラメータモデルとして実装され、キュレートされたUnreal EngineレンダリングデータセットでトレーニングされたRELICは、16 FPSでリアルタイム生成を実現すると同時に、より正確なアクション、より安定したロングホライゾンストリーミング、より堅牢な空間メモリ検索を示す。これらの能力は、次世代のインタラクティブな世界モデリングの強力な基盤として、RELICを確立します。

論文の概要: RELIC: Interactive Video World Model with Long-Horizon Memory

関連論文リスト