Fugu-MT 論文翻訳(概要): VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

論文の概要: VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

arxiv url: http://arxiv.org/abs/2603.25420v1
Date: Thu, 26 Mar 2026 13:14:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.310248
Title: VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
Title（参考訳）: VideoWeaver:マルチモーダルなマルチビュービデオ転送
Authors: George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Yang Bai, Liudi Yang, Ziyuan Liu,
Abstract要約: VideoWeaverは、最初のマルチモーダルマルチビューV2V翻訳フレームワークである。我々は、異なる拡散時間ステップでビューを訓練し、モデルがジョイントとコンディショナルの両方のビュー分布を学習できるようにする。実験では、単一ビューの翻訳ベンチマークにおける最先端の性能よりも優れているか類似した性能を示す。
参考スコア（独自算出の注目度）: 17.66237759970927
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.
Abstract（参考訳）: ビデオ間翻訳(V2V)の最近の進歩は、事前訓練されたロボットポリシーを追加のデータ収集なしで新しい環境に転送できる機能である、具体化されたAIデモの現実的な再現を可能にした。しかし、事前の作業は一度にひとつのビューでしか実行できません。一方、具体化されたAIタスクは、ポリシー学習をサポートするために、複数の同期カメラから一般的にキャプチャされます。各カメラに独立してシングルビューモデルを適用すると、ビュー間の不整合が生じるため、標準的なトランスフォーマーアーキテクチャは、クロスビューの注意の二次コストのため、マルチビュー設定にスケールしない。本稿では,マルチモーダルなV2V翻訳フレームワークであるVideoWeaverを紹介する。 VideoWeaverは当初、単一ビューフローベースのV2Vモデルとしてトレーニングされている。マルチビューシステムの拡張を実現するため、フィードフォワード空間基盤モデル、すなわちPi3から派生した共有4次元潜伏空間において、すべてのビューをグラウンド化することを提案する。これにより、広いベースラインとダイナミックカメラモーションの下でも、視野に一貫性のある外観が促進される。一定数のカメラを超えてスケールするために、異なる拡散時間ステップでビューを訓練し、モデルがジョイントとコンディショナルの両方のビュー分布を学習できるようにする。これにより、既存の観点で条件付けられた新しい視点の自己回帰合成が可能になる。実験では、単一ビューの翻訳ベンチマークにおいて最先端または類似のパフォーマンスを示し、ロボット学習の世界のランダム化の中心となる、エゴセントリックで異質なカメラのセットアップに挑戦するなど、物理的およびスタイリスティックに整合した多視点翻訳を初めて行った。

論文の概要: VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

関連論文リスト