Related papers: CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

URL: http://arxiv.org/abs/2603.04291v1
Date: Wed, 04 Mar 2026 17:06:56 GMT
Title: CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
Authors: Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan,
Abstract summary: We introduce a novel Cube-temporal autoregressive diffusion model that generates 4K-resolution 360 videos.<n>By decomposing videos into cubemap representations with six faces, CubeComposer autogressively synthesizes content in a well-planned order.<n> experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality.
Score: 86.80231588752957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer

Related papers

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering [15.79758281898629]
generative models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video.<n>This paper explores a new strategy for camera-conditioned video generation of static scenes.<n>Our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency.
arXiv Detail & Related papers (2026-01-14T18:50:06Z)
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z)
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model [39.24524388617938]
$mathbfMavors$ is a novel framework for holistic longvideo modeling.<n>Mavors encodes raw video content into latent representations through two core components.<n>The framework unifies image and video understanding by treating images as single-frame videos.
arXiv Detail & Related papers (2025-04-14T10:14:44Z)
CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation [59.257513664564996]
We introduce a novel method for generating 360deg panoramas from text prompts or images.<n>We employ multi-view diffusion models to jointly synthesize the six faces of a cubemap.<n>Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set.
arXiv Detail & Related papers (2025-01-28T18:59:49Z)
VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models [58.464465016269614]
We propose a novel framework for solving high-definition video inverse problems using latent image diffusion models.<n>Our approach delivers HD-resolution reconstructions in under 6 seconds per frame on a single NVIDIA 4090 GPU.
arXiv Detail & Related papers (2024-11-29T08:10:49Z)
MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.<n>First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.<n>Second, we present MotionAura, a text-to-video generation framework.<n>Third, we propose a spectral transformer-based denoising network.<n>Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z)
Pyramidal Flow Matching for Efficient Video Generative Modeling [67.03504440964564]
This work introduces a unified pyramidal flow matching algorithm.<n>It sacrifices the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution.<n>The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT)
arXiv Detail & Related papers (2024-10-08T12:10:37Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.