Fugu-MT 論文翻訳(概要): Can World Models Benefit VLMs for World Dynamics?

論文の概要: Can World Models Benefit VLMs for World Dynamics?

arxiv url: http://arxiv.org/abs/2510.00855v1
Date: Wed, 01 Oct 2025 13:07:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.568626
Title: Can World Models Benefit VLMs for World Dynamics?
Title（参考訳）: 世界モデルとVLMは世界ダイナミクスに相応しいか?
Authors: Kevin Zhang, Kuangzhi Ge, Xiaowei Chi, Renrui Zhang, Shaojun Shi, Zhen Dong, Sirui Han, Shanghang Zhang,
Abstract要約: 本研究では,世界モデル先行モデルがビジョンランゲージモデルに移行した場合の能力について検討する。最高の性能を持つDynamic Vision Aligner (DyVA) と名付けます。 DyVAはオープンソースとプロプライエタリの両方のベースラインを超え、最先端または同等のパフォーマンスを実現しています。
参考スコア（独自算出の注目度）: 59.73433292793044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM's inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.
Abstract（参考訳）: インターネット規模のビデオデータに基づいてトレーニングされた生成的世界モデルは、構造、運動、物理に関する一貫性のある、もっともらしいダイナミクスを生成できる強力な世界シミュレータとして、ますます認識されている。強力なビデオ基盤モデルの出現により、彼らは汎用マルチモーダル理解のために従来のビジョンエンコーダパラダイムに取って代わるのだろうか? 近年の研究では、一般的な視覚タスクにおける世界モデルの可能性を探る研究が始まっているが、これらの探索は通常、一般的なマルチモーダルタスクに関する体系的な研究を欠いている。本研究では,映像拡散モデルを生成エンコーダとして再使用し,単一の認知ステップを実行し,その結果の潜伏者を視覚的埋め込みの集合として扱う。本研究は,世界言語モデル(World-Language Models, WorldLMs)と呼ばれる,このタイプのモデルを実証的に検討し,生成エンコーダが従来のエンコーダとの区別を示す下流理解に役立つ潜伏者を捕捉できることを見出した。 DyVA(Dynamic Vision Aligner)と命名することで,この手法が空間推論能力を大幅に向上し,マルチフレーム推論を単一画像モデルで実現できることがさらに明らかになった。視覚的推論タスクのスイートのキュレーションを通じて、DyVAはオープンソースとプロプライエタリの両方のベースラインを超え、最先端または同等のパフォーマンスを実現していると考えています。 We attribute that these gains of WorldLM's inherited motion-consistency internalization from video pre-training。最後に,今後の作業に期待できる方向性を明らかにするため,広範囲なモデル設計を体系的に検討する。我々の研究は、世界モデルから先進的な先進的要素を生かし、一般のビジョン学習者への有望な道のりをたどることができることを願っている。

論文の概要: Can World Models Benefit VLMs for World Dynamics?

関連論文リスト