Fugu-MT 論文翻訳(概要): Stereo World Model: Camera-Guided Stereo Video Generation

論文の概要: Stereo World Model: Camera-Guided Stereo Video Generation

arxiv url: http://arxiv.org/abs/2603.17375v1
Date: Wed, 18 Mar 2026 05:42:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.524831
Title: Stereo World Model: Camera-Guided Stereo Video Generation
Title（参考訳）: ステレオワールドモデル:カメラによるステレオビデオ生成
Authors: Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi,
Abstract要約: 本稿では、ステレオビデオ生成のための外観と両眼形状を共同で学習するカメラコンディショニングステレオワールドモデルであるStereoWorldを紹介する。単分子RGBやRGBDのアプローチとは異なり、StereoWorldはRGBモードでのみ動作する。
参考スコア（独自算出の注目度）: 52.3922115596956
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.
Abstract（参考訳）: 本稿では,モノクラーRGBやRGBDのアプローチとは異なり,StereoWorldはRGBモダリティ内でのみ動作すると同時に,幾何学と相違点を直接的にグルーピングしながら,外観と両眼図形を同時学習するステレオワールドモデルを提案する。ステレオ生成を効率よく行うために,1)カメラ対応回転位置符号化による潜在トークンの強化,2)カメラ対応位置符号化による相対的,視点的,時間的整合性条件の維持,2)3次元視野内注意と水平列注意に4次元の注意をフルに分解するステレオ認識分解,の2つの重要な設計を提案する。ベンチマーク全体では、StereoWorldはステレオ一貫性、不均一さの正確性、カメラモーションの忠実さを、強力なモノクロ・タン変換パイプラインよりも向上させ、視点整合性で5%向上した3倍以上の高速な生成を実現している。ベンチマーク以外にも、StereoWorldは、深度推定や塗装を行わずに、エンドツーエンドの両眼VRレンダリングを可能にし、メートル法スケールの深度グラウンドングによるエボデードポリシー学習を強化し、インタラクティブなステレオ合成を拡張するための長期ビデオ蒸留と互換性がある。

論文の概要: Stereo World Model: Camera-Guided Stereo Video Generation

関連論文リスト